Open amykglen opened 3 years ago
OK I have committed an attempt at a partial fix for this. I believe that UMLS:RB
relations are a source of the problem.
Thank you @amykglen for the very helpful example!
Hi @ericawood and @acevedol and @kvarforl please feel free to revert my change if I am taking the wrong tack in trying to address this issue.
I think that's a good start! To try to see how many cycles Steve's changes would fix, I ran the following cypher:
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by and "UMLS:RB" = rel.relation) return count(p)
And it unfortunately only returned 12 (out of the 894 non semmed cycles Amy's query found). So I think its definitely a good start but there's more to track down!
Edit: I think my cypher query is wonky. hold please :)
is maybe this the cypher query you'd want?
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) AND any(rel in relationships(p) WHERE "UMLS:RB" = rel.relation) return count(p)
looks like it still returns 12 though, interestingly!
Ah yes thanks @amykglen ! That one is better. I got psyched out because, as you mentioned, it also returns 12 😅
looks like LOINC:class_of
and rdfs:subClassOf
might also be part of the problem, but I'm not entirely sure (my cypher skills are somewhat lacking, I'll admit ha).
looking at rdfs:subClassOf
, this cypher query returns 882.
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) AND any(rel in relationships(p) WHERE "rdfs:subClassOf" = rel.relation) return count(p)
And for LOINC:class_of
, we get 600.
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) AND any(rel in relationships(p) WHERE "LOINC:class_of" = rel.relation) return count(p)
My skepticism comes in since 600 + 882 > 894, but I suppose that might make sense since subclass of cycles probably don't consist of relations from all the same source.
Some data points from KG2.6.5:
Subclasses cycles with 1-3 hops
match p=(n)-[:`biolink:subclass_of` *1..3]->(n) return count(p)
returns 948
Subclasses cycles with 1-3 hops without SEMMED
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by) return count(p)
returns 882
Subclasses cycles with 1-4 hops
match p=(n)-[:`biolink:subclass_of` *1..4]->(n) return count(p)
returns 30996
Subclasses cycles with 1-3 hops by original relation
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) unwind(relationships(p)) as rels return rels.relation, count(p)
rels.relation | count(p) |
---|---|
"SEMMEDDB:isa" | 154 |
"rdfs:subClassOf" | 1766 |
"LOINC:class_of" | 600 |
"LOINC:has_class" | 270 |
Subclasses cycles with 1-5 hops
match p=(n)-[:`biolink:subclass_of` *1..5]->(n) return count(p)
returns 129436
In conclusion, it is still a big problem in KG2.6.5.
Seems like a near-impossible task to clean all this up 100%? so it seems like some ARAX/Expander tolerance of this is inescapable? Especially if the same problem probably occurs in other KPs that we cannot clean up?
yeah, I'm in favor of selecting a few sources within KG2 that should be very reliable and having our KP reasoning only consider subclass_of
edges from those.
(and thanks for the nice write-up, @ericawood!)
Do you think it would be helpful to break down rdfs:subClassOf
into an subclass relation for each individual source to do that?
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) unwind(relationships(p)) as rels return rels.relation, rels.provided_by, count(p)
rels.relation | rels.provided_by | count(p) |
---|---|---|
"SEMMEDDB:isa" | ["SEMMEDDB:"] | 154 |
"rdfs:subClassOf" | ["OBO:go/extensions/go-plus.owl"] | 4 |
"rdfs:subClassOf" | ["umls_source:GO"] | 2 |
"rdfs:subClassOf" | ["umls_source:HL7"] | 2 |
"rdfs:subClassOf" | ["OBO:hp.owl", "OBO:mondo.owl"] | 2 |
"rdfs:subClassOf" | ["umls_source:HPO"] | 2 |
"rdfs:subClassOf" | ["umls_source:LNC"] | 1740 |
"LOINC:class_of" | ["umls_source:LNC"] | 600 |
"LOINC:has_class" | ["umls_source:LNC"] | 270 |
"rdfs:subClassOf" | ["umls_source:NCBITAXON"] | 2 |
"rdfs:subClassOf" | ["OBO:ncbitaxon/subsets/taxslim.owl"] | 2 |
"rdfs:subClassOf" | ["OBO:chebi.owl"] | 2 |
"rdfs:subClassOf" | ["OBO:hp.owl", "OBO:nbo.owl"] | 2 |
"rdfs:subClassOf" | ["OBO:uberon/ext.owl"] | 2 |
"rdfs:subClassOf" | ["EFO:efo.owl"] | 2 |
"rdfs:subClassOf" | ["ORPHANET:"] | 2 |
For a short term solution, it might be sufficient to avoid using LOINC subclass relationships (and avoid using SemMed subclass relationships):
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by and not 'umls_source:LNC' in rel.provided_by) return count(p)
count(p) --| 12
The downside of this solution is that you would lose about 8% of the subclass relationships (in KG2.6.7).
Even without taking out SemMed, it's a lot better:
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not 'umls_source:LNC' in rel.provided_by) return count(p)
count(p) --| 78
This does not produce as good of results in KG2.6.7c:
match p=(n)-[:`biolink:subclass_of`*1..3]->(n) WHERE all(rel in relationships(p) WHERE not "SEMMEDDB:" in rel.provided_by and not 'umls_source:LNC' in rel.provided_by) return count(p)
count(p) --| 4344
for reference: the issue about
subclass_of
cycles in KG2c (vs. KG2 itself) is here: https://github.com/RTXteam/RTX/issues/1367some examples for you:
returns 1,240 in KG2.6.4
(I only counted up to 3 hops here as it takes quite a while to look for longer paths, but I think there are many more cycles with 4+ hops)
it seems many of them involve SEMMEDDB edges, but a good chunk don't... e.g., doing the same query as above but filtering out SEMMEDDB edges:
returns 894 in KG2.6.4.
here's a specific example - all edges in this cycle are from UMLS MTH:
and an example of longer cycles - "Histamine" is apparently part of 90
subclass_of
cycles up to 10 hops long: