Open amykglen opened 3 years ago
Agree this is a problem that we need to address.
I'm kind of thinking #1342 will be critical to addressing it. What do you think?
for example, should we only trust biolink:subclass_of edges from certain provided_bys? (e.g., maybe don't trust such edges from SEMMED?)
Seems like a promising idea. I also wonder if we should schedule a KG2 hackathon to work on this.
I'm kind of thinking #1342 will be critical to addressing it. What do you think?
yeah, totally agree! would be a bit of a nightmare trying to address it without that.. 😂
this issue is now unblocked with KG2c.6.3 (http://kg2c-6-3.rtx.ai:7474/browser/), thanks to #1342
also worth noting - some of these are in KG2 itself (vs. just KG2c):
match p=(n)-[:`biolink:subclass_of` *1..3]->(n) return count(p)
returns 1,240 in KG2.6.3
(only counted up to 3 hops as it takes quite a while to look for longer paths)
it seems many of them involve SEMMEDDB edges, but not all do... here's one example (in KG2.6.3) - one edge is from OBO:go/extensions/go-plus.owl
and the other edge is from umls_source:GO
:
match p=(n)-[e1:`biolink:subclass_of`]->(m)-[e2:`biolink:subclass_of`]->(n) where not "SEMMEDDB:" in e1.provided_by and not "SEMMEDDB:" in e2.provided_by return p limit 2
(not sure if there should be a separate issue for these KG2 cycles, or if they'll just be addressed as part of work on this issue?)
one more example to highlight how crazy the subclass_of
situation is :) (which would also provide a good test case for whatever solution is worked up):
in KG2c, if you look for nodes that are subclass_of diabetes (MONDO:0005015) and go up to 6 levels deep, you wind up with over 250,000 distinct nodes:
match p=(n {id:"MONDO:0005015"})<-[:`biolink:subclass_of` *1..6]-(m) return count(distinct m)
returns 263,647 (on KG2c.6.3)
here's a random sample of some of these 263k nodes deemed 'subclasses' of diabetes:
match p=(n {id:"MONDO:0005015"})<-[:`biolink:subclass_of` *1..6]-(m) return distinct m.id, m.name order by rand() limit 200
m.id | m.name
-- | --
"OMIM:MTHU019953" | "Long phalanges"
"NCBITaxon:557599" | "Mycobacterium kansasii ATCC 12478"
"UMLS:C2881015" | "Bilateral acute angle-closure glaucoma"
"CHEMBL.COMPOUND:CHEMBL1561505" | "SID26666821"
"OMIM:MTHU032492" | "Defects in executive function"
"UniProtKB:P17098" | "ZNF8"
"MONDO:0001482" | "testicular leukemia"
"PR:O89110" | "caspase-8 (mouse)"
"UMLS:C3862265" | "Tendonitis of right wrist"
"CHEBI:165052" | "Tyr-Glu-Ala"
"UMLS:C0334601" | "Undifferentiated Retinoblastoma"
"MESH:D018092" | "Receptors, Kainic Acid"
"PR:P22725-1" | "protein Wnt-5a isoform 1 (mouse)"
"UMLS:C2228234" | "Episcleritis of left eye"
"UMLS:C3665458" | "Hypertensive heart AND chronic kidney disease with congestive heart failure"
"OMIM:MTHU037851" | "Short limbs (in some patients)"
"PR:P13405" | "retinoblastoma-associated protein (mouse)"
"OMIM:MTHU018855" | "Most remit by 6 weeks (1-6 months)"
"CHEMBL.COMPOUND:CHEMBL598951" | "BRAZILIN"
"UMLS:C3554724" | "Complete duplication of thumb phalanx"
"OMIM:MTHU005989" | "Progressive disorder due to secondary myopathy"
"VANDF:4023749" | "Fungi nail"
"UMLS:C2987267" | "Esophageal Synovial Sarcoma"
@amykglen @saramsey is this still relevant?
@amykglen should we close out this issue, or transfer it to the PloverDB project area, or transfer it to the RTX-KG2 project area?
hmm, I suppose we should probably keep this open. we have #RTXteam/RTX-KG2#63 for tracking this problem in KG2pre, but this one is to track the issue in KG2c, whose code still lives in the RTX repo (and this isn't a Plover issue).
it might be relevant to entity resolution work as well (I suspect KG2c has some cycles that KG2pre does not, due to incorrect merging of concepts)
in working on KP reasoning requirement 1) in #1268, I went to build an index for Plover that recursively finds all nodes that are
biolink:subclass_of
a given node in KG2c. (so that if someone is looking for 'diabetes' in their query graph, the query will also effectively consider 'type 2 diabetes', as well as anything that might be a subclass of 'type 2 diabetes', and so on...)but it quickly became apparent that there are a lot of (directed) 'subclass_of' cycles in KG2c. a couple examples from http://kg2c-5-2.rtx.ai:7474/browser/:
and there seem to be many such cycles - apparently 26,000 with up to 3 edges in them, but there are also larger cycles (they take forever to count, so I don't have a number, but I suspect it's large). for example, here's a 7-edge one involving Acetaminophen:
(acetaminophen alone is apparently part of about 700 subclass_of cycles with up to 7 edges)
this definitely makes my task much harder, though I'm sure I can find a way to work around it... but in the bigger picture, should we worry about reasoning using this data? it's seeming like a bit of the wild west... I'm guessing most of these are the result of little bugs in KG2/its upstream sources and/or the node synonymizer (haven't dived in to investigate)... and I imagine they'll be quite hard to totally eradicate.
maybe there's some cleaner way of getting 'subclass_of' info for this purpose (KP reasoning)? for example, should we only trust
biolink:subclass_of
edges from certainprovided_by
s? (e.g., maybe don't trust such edges from SEMMED?)