Decide on entailment strategy for merged KG

cmungall commented 4 years ago

See full doc: https://docs.google.com/document/d/1nUVnQ90MHMofOFrko5j_uUo72EEWfEOwRxILf-G5SNs/edit

Assuming KG is largely in ABox

Direct
- Symmetry (e.g. A interacts with B)
- InverseOf (e.g. A part-of B => B has-part A)
Indirect
- Transitivity
  - SubPropertyOf
  - Type/Category + SubClassOf
- Property Chain

We should have some strategy. I want the same strategy for all our KGs.

Do we assert entailments in the final KG? Do we make a separate KG with inferences? Do we stratify inferences by direct vs indirect? Why entailments is it necessary to assert, and where? Do ingest modules take care of trivial symmetry entailment assertion?

Do we use Arachne? (Remember the KG is mostly in the ABox, we can even put the ontologies in the ABox). Will this scale over the whole KG?

What are the requirements for downstream consumers? E.g embiggen is fine with interacts-with asserted in one direction as it assumes an undirected graph for walking (correct @justaddcoffee?). Cypher consumers may be happier with no symmetry entailment assertion since it's easier to write queries that are direction neutral. It's also not so hard with SPARQL (interaction_with|^interacts_with) but it gets a little awkward and counter to user expectations?

What is our strategy for subProperty entailment? Historically in Monarch we have relied on a cypher extension to query the subPropertyOf reflexive closure but this is problematic. Do we add additional edges? This would confuse many applications. Or do we do this with a new edge property e.g. entailed_edge_label?

What is our strategy for subClassOf within the data model space i.e. biolink? Do we use bl:category or a new property for bl:entailed_category? What about more detailed classification, e.g. to a mondo class? Do we make edges to all ancestors?

Specific tickets on other repos:

https://github.com/Knowledge-Graph-Hub/kg-covid-19/issues/328

cmungall commented 4 years ago

Relevant paper (h/t) @matentzn @balhoff https://arxiv.org/abs/2009.00318

we investigate the effect of materializing implicit A-box axioms induced by subproperties, as well as symmetric and transitive properties. While it might be a reasonable assumption that such a materialization before computing embeddings might lead to better embeddings, we conduct a set of experiments on DBpedia which demonstrate that the materialization actually has a negative effect on the performance of RDF2vec. In our analysis, we argue that despite the huge body of work devoted on completing missing information in knowledge graphs, such missing implicit information is actually a signal, not a defect, and we show examples illustrating that assumption

Comment paraphrased from slack:

@matentzn : seems the message is that materializing transitive entailments leads to worse performance; not necessarily case for other entailments

matentzn commented 4 years ago

I would intuitively say that materialising transitive closure generates noise.. I love the question this paper is raising, but I would not just base any decisions on it alone. We need use-case driven experiments! Exciting stuff!

cmungall commented 4 years ago

I would intuitively say that materialising transitive closure generates noise..

If you are talking about the context of naive node2vec style algorithms I share similar intuitions, the sampling is biased and more noise than signal is introduced. My intuition is that this holds also for other entailments e.g symmetry, inverseOf (depending on walk algorithm)

This ticket is not about one particular use case (KG embedding using random graph walks) though. Given a variety of use cases, some of which we can anticipate, some of which we can't, what is our strategy?

The default is no strategy, no entailments are materialized, and we punt responsibility to the client. The default client behavior here is no OWL semantics, and clients walking graphs blind to edge label, object property characteristics. This seems unsatisfying.

Even in trivial cases, how would clients know whether to traverse over interacts_with or (interacts_with|^interacts_with)?

matentzn commented 4 years ago

The default is no strategy, no entailments are materialized, and we punt responsibility to the client. The default client behavior here is no OWL semantics, and clients walking graphs blind to edge label, object property characteristics. This seems unsatisfying.

Not sure, given our lack of understanding of the consequences of just materialising role hierarchies, chains and characteristics blindly, I still think the default should be isomorphic to the raw data ingest. But I am a bit on the fence (55% on the side of raw data shape).

Even in trivial cases, how would clients know whether to traverse over interacts_with or (interacts_with|^interacts_with)?

Probably a more complicated discussion but I would say that this problem is the same for the source data and the graph; so the answer is: they dont; they can query what the source data providers intended. We are not even sure if seemingly symmetric relations are actually symmetric (friendship?); what I am trying to say is that at ingest time we mapped property in the source data to a RO relation due to some terminological resemblance and then apply RO semantics to the data which IMO could be risky, i.e. not reflecting the intention of the data.

I would say: Lets start from raw, and then enumerate a set of profiles (we dont even need to call it OWL, we could call it OBO!) that can be passed as parameters:

- materialisation profile:
  - id: OBO
  - atomic transitive reduct: true
  - atomic transitive closure: false
  - symmetry: true
  - reflexivity: false
  - sub-property: direct # as opposed to indirect, none
  - equivalent-property: true
  - property-chain: true # materialises all role chains directly (not necessarily transitive closure)
  - inverses: true # as you suggest
  - domains: true
  - ranges: true

We could then, for our graphs, resort to that profile "by default". Something along these lines.

cmungall commented 3 years ago

Not sure, given our lack of understanding of the consequences of just materialising role hierarchies, chains and characteristics blindly, I still think the default should be isomorphic to the raw data ingest.

We should be able to determine the consequences

We are not even sure if seemingly symmetric relations are actually symmetric (friendship?)

Of course we would not guess. The property would be declared or not declared symmetric.

what I am trying to say is that at ingest time we mapped property in the source data to a RO relation due to some terminological resemblance and then apply RO semantics to the data which IMO could be risky, i.e. not reflecting the intention of the data

totally not following. No one should do anything based on resemblance to anything. The person writing an ingest would choose the most appropriate relationship to use. They would look at properties such as symmetry and transitivity and sub and superrelations. And they would pick something appropriate.

would say: Lets start from raw, and then enumerate a set of profiles

still not following what you mean by raw, but I like the idea of having a simple yaml for profiles, and being very explicit about (a) what is materialized at ingest time (b) what is done as some kind of post-processing reasoning step by a reasoner (c) what the client is intended to do.

matentzn commented 3 years ago

I added this issue to our Monday agenda.

cmungall commented 3 years ago

related. OWL2Vec* https://arxiv.org/abs/2009.14654

We also analyzed the impact of using reasoning (provided by OWL 2 reasoner HermiT) before the ontology is transformed into an RDF graph, as shown in Table 5. We can see that reasoning has a limited impact in the conducted experiments; the MRR results with and without reasoning are quite close w.r.t. all four methods tested.

My notes: I suspect this is very dependent on the ontologies and the particular training task

matentzn commented 3 years ago

In our last conversation you convinced me "start from" the default entailment profile rather than from raw (no materialisation). However, I still think this discussion is moot, because from an analysis perspective, we obviously want to compare the entailment profile suggested with the raw ingest (no materialisation at all). raw means no materalisation whatsovever.

TranslatorSRI / reference-kg

Decide on entailment strategy for merged KG #9