RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
38 stars 8 forks source link

subclass_of cycles in KG2 #63

Open amykglen opened 3 years ago

amykglen commented 3 years ago

for reference: the issue about subclass_of cycles in KG2c (vs. KG2 itself) is here: https://github.com/RTXteam/RTX/issues/1367

some examples for you:

match p=(n)-[:`biolink:subclass_of` *1..3]->(n) return count(p)

returns 1,240 in KG2.6.4

(I only counted up to 3 hops here as it takes quite a while to look for longer paths, but I think there are many more cycles with 4+ hops)

it seems many of them involve SEMMEDDB edges, but a good chunk don't... e.g., doing the same query as above but filtering out SEMMEDDB edges:

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) return count(p)

returns 894 in KG2.6.4.

here's a specific example - all edges in this cycle are from UMLS MTH:

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) return p limit 1

Screen Shot 2021-06-01 at 11 43 45 AM

and an example of longer cycles - "Histamine" is apparently part of 90 subclass_of cycles up to 10 hops long:

match p=(n {id:'UMLS:C0019588'})-[:`biolink:subclass_of` *1..10]->(n) return count(p)
saramsey commented 3 years ago

OK I have committed an attempt at a partial fix for this. I believe that UMLS:RB relations are a source of the problem.

saramsey commented 3 years ago

Thank you @amykglen for the very helpful example!

saramsey commented 3 years ago

Hi @ericawood and @acevedol and @kvarforl please feel free to revert my change if I am taking the wrong tack in trying to address this issue.

kvarforl commented 3 years ago

I think that's a good start! To try to see how many cycles Steve's changes would fix, I ran the following cypher:

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by and "UMLS:RB" = rel.relation) return count(p)

And it unfortunately only returned 12 (out of the 894 non semmed cycles Amy's query found). So I think its definitely a good start but there's more to track down!

Edit: I think my cypher query is wonky. hold please :)

amykglen commented 3 years ago

is maybe this the cypher query you'd want?

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) AND any(rel in relationships(p) WHERE "UMLS:RB" = rel.relation) return count(p)

looks like it still returns 12 though, interestingly!

kvarforl commented 3 years ago

Ah yes thanks @amykglen ! That one is better. I got psyched out because, as you mentioned, it also returns 12 😅

kvarforl commented 3 years ago

looks like LOINC:class_of and rdfs:subClassOf might also be part of the problem, but I'm not entirely sure (my cypher skills are somewhat lacking, I'll admit ha).

looking at rdfs:subClassOf, this cypher query returns 882.

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) AND any(rel in relationships(p) WHERE "rdfs:subClassOf" = rel.relation) return count(p)

And for LOINC:class_of, we get 600.

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) AND any(rel in relationships(p) WHERE "LOINC:class_of" = rel.relation) return count(p)

My skepticism comes in since 600 + 882 > 894, but I suppose that might make sense since subclass of cycles probably don't consist of relations from all the same source.

ecwood commented 3 years ago

Some data points from KG2.6.5:

Subclasses cycles with 1-3 hops

match p=(n)-[:`biolink:subclass_of` *1..3]->(n) return count(p)

returns 948

Subclasses cycles with 1-3 hops without SEMMED

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) return count(p)

returns 882

Subclasses cycles with 1-4 hops

match p=(n)-[:`biolink:subclass_of` *1..4]->(n) return count(p)

returns 30996

Subclasses cycles with 1-3 hops by original relation

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) unwind(relationships(p)) as rels return rels.relation, count(p)
rels.relation count(p)
"SEMMEDDB:isa" 154
"rdfs:subClassOf" 1766
"LOINC:class_of" 600
"LOINC:has_class" 270

Subclasses cycles with 1-5 hops

match p=(n)-[:`biolink:subclass_of` *1..5]->(n) return count(p)

returns 129436

In conclusion, it is still a big problem in KG2.6.5.

edeutsch commented 3 years ago

Seems like a near-impossible task to clean all this up 100%? so it seems like some ARAX/Expander tolerance of this is inescapable? Especially if the same problem probably occurs in other KPs that we cannot clean up?

amykglen commented 3 years ago

yeah, I'm in favor of selecting a few sources within KG2 that should be very reliable and having our KP reasoning only consider subclass_of edges from those.

(and thanks for the nice write-up, @ericawood!)

ecwood commented 3 years ago

Do you think it would be helpful to break down rdfs:subClassOf into an subclass relation for each individual source to do that?

ecwood commented 3 years ago
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) unwind(relationships(p)) as rels return rels.relation, rels.provided_by, count(p)
rels.relation rels.provided_by count(p)
"SEMMEDDB:isa" ["SEMMEDDB:"] 154
"rdfs:subClassOf" ["OBO:go/extensions/go-plus.owl"] 4
"rdfs:subClassOf" ["umls_source:GO"] 2
"rdfs:subClassOf" ["umls_source:HL7"] 2
"rdfs:subClassOf" ["OBO:hp.owl", "OBO:mondo.owl"] 2
"rdfs:subClassOf" ["umls_source:HPO"] 2
"rdfs:subClassOf" ["umls_source:LNC"] 1740
"LOINC:class_of" ["umls_source:LNC"] 600
"LOINC:has_class" ["umls_source:LNC"] 270
"rdfs:subClassOf" ["umls_source:NCBITAXON"] 2
"rdfs:subClassOf" ["OBO:ncbitaxon/subsets/taxslim.owl"] 2
"rdfs:subClassOf" ["OBO:chebi.owl"] 2
"rdfs:subClassOf" ["OBO:hp.owl", "OBO:nbo.owl"] 2
"rdfs:subClassOf" ["OBO:uberon/ext.owl"] 2
"rdfs:subClassOf" ["EFO:efo.owl"] 2
"rdfs:subClassOf" ["ORPHANET:"] 2
ecwood commented 3 years ago

For a short term solution, it might be sufficient to avoid using LOINC subclass relationships (and avoid using SemMed subclass relationships):

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by and not 'umls_source:LNC' in rel.provided_by) return count(p)

count(p) --| 12

The downside of this solution is that you would lose about 8% of the subclass relationships (in KG2.6.7).

Even without taking out SemMed, it's a lot better:

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not 'umls_source:LNC' in rel.provided_by) return count(p)

count(p) --| 78

This does not produce as good of results in KG2.6.7c:

match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by and not 'umls_source:LNC' in rel.provided_by) return count(p)

count(p) --| 4344