RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

'subclass_of' cycles in KG2c #1367

Open amykglen opened 3 years ago

amykglen commented 3 years ago

in working on KP reasoning requirement 1) in #1268, I went to build an index for Plover that recursively finds all nodes that are biolink:subclass_of a given node in KG2c. (so that if someone is looking for 'diabetes' in their query graph, the query will also effectively consider 'type 2 diabetes', as well as anything that might be a subclass of 'type 2 diabetes', and so on...)

but it quickly became apparent that there are a lot of (directed) 'subclass_of' cycles in KG2c. a couple examples from http://kg2c-5-2.rtx.ai:7474/browser/:

match p=(n)-[:`biolink:subclass_of` *3..4]->(n) return p limit 3

Screen Shot 2021-04-11 at 12 13 49 PM

and there seem to be many such cycles - apparently 26,000 with up to 3 edges in them, but there are also larger cycles (they take forever to count, so I don't have a number, but I suspect it's large). for example, here's a 7-edge one involving Acetaminophen:

match p=(n {id:'CHEMBL.COMPOUND:CHEMBL112'})<-[:`biolink:subclass_of` *1..7]-(n) return p limit 1
Screen Shot 2021-04-11 at 5 09 02 PM

(acetaminophen alone is apparently part of about 700 subclass_of cycles with up to 7 edges)

this definitely makes my task much harder, though I'm sure I can find a way to work around it... but in the bigger picture, should we worry about reasoning using this data? it's seeming like a bit of the wild west... I'm guessing most of these are the result of little bugs in KG2/its upstream sources and/or the node synonymizer (haven't dived in to investigate)... and I imagine they'll be quite hard to totally eradicate.

maybe there's some cleaner way of getting 'subclass_of' info for this purpose (KP reasoning)? for example, should we only trust biolink:subclass_of edges from certain provided_bys? (e.g., maybe don't trust such edges from SEMMED?)

saramsey commented 3 years ago

Agree this is a problem that we need to address.

I'm kind of thinking #1342 will be critical to addressing it. What do you think?

saramsey commented 3 years ago

for example, should we only trust biolink:subclass_of edges from certain provided_bys? (e.g., maybe don't trust such edges from SEMMED?)

Seems like a promising idea. I also wonder if we should schedule a KG2 hackathon to work on this.

amykglen commented 3 years ago

I'm kind of thinking #1342 will be critical to addressing it. What do you think?

yeah, totally agree! would be a bit of a nightmare trying to address it without that.. 😂

amykglen commented 3 years ago

this issue is now unblocked with KG2c.6.3 (http://kg2c-6-3.rtx.ai:7474/browser/), thanks to #1342

amykglen commented 3 years ago

also worth noting - some of these are in KG2 itself (vs. just KG2c):

match p=(n)-[:`biolink:subclass_of` *1..3]->(n) return count(p)

returns 1,240 in KG2.6.3

(only counted up to 3 hops as it takes quite a while to look for longer paths)

it seems many of them involve SEMMEDDB edges, but not all do... here's one example (in KG2.6.3) - one edge is from OBO:go/extensions/go-plus.owl and the other edge is from umls_source:GO:

match p=(n)-[e1:`biolink:subclass_of`]->(m)-[e2:`biolink:subclass_of`]->(n) where not "SEMMEDDB:" in e1.provided_by and not "SEMMEDDB:" in e2.provided_by return p limit 2

Screen Shot 2021-05-13 at 11 36 00 AM

(not sure if there should be a separate issue for these KG2 cycles, or if they'll just be addressed as part of work on this issue?)

amykglen commented 3 years ago

one more example to highlight how crazy the subclass_of situation is :) (which would also provide a good test case for whatever solution is worked up):

in KG2c, if you look for nodes that are subclass_of diabetes (MONDO:0005015) and go up to 6 levels deep, you wind up with over 250,000 distinct nodes:

match p=(n {id:"MONDO:0005015"})<-[:`biolink:subclass_of` *1..6]-(m) return count(distinct m)

returns 263,647 (on KG2c.6.3)

here's a random sample of some of these 263k nodes deemed 'subclasses' of diabetes:

match p=(n {id:"MONDO:0005015"})<-[:`biolink:subclass_of` *1..6]-(m) return distinct m.id, m.name order by rand() limit 200
m.id | m.name
-- | --
"OMIM:MTHU019953" | "Long phalanges"
"NCBITaxon:557599" | "Mycobacterium kansasii ATCC 12478"
"UMLS:C2881015" | "Bilateral acute angle-closure glaucoma"
"CHEMBL.COMPOUND:CHEMBL1561505" | "SID26666821"
"OMIM:MTHU032492" | "Defects in executive function"
"UniProtKB:P17098" | "ZNF8"
"MONDO:0001482" | "testicular leukemia"
"PR:O89110" | "caspase-8 (mouse)"
"UMLS:C3862265" | "Tendonitis of right wrist"
"CHEBI:165052" | "Tyr-Glu-Ala"
"UMLS:C0334601" | "Undifferentiated Retinoblastoma"
"MESH:D018092" | "Receptors, Kainic Acid"
"PR:P22725-1" | "protein Wnt-5a isoform 1 (mouse)"
"UMLS:C2228234" | "Episcleritis of left eye"
"UMLS:C3665458" | "Hypertensive heart AND chronic kidney disease with congestive heart failure"
"OMIM:MTHU037851" | "Short limbs (in some patients)"
"PR:P13405" | "retinoblastoma-associated protein (mouse)"
"OMIM:MTHU018855" | "Most remit by 6 weeks (1-6 months)"
"CHEMBL.COMPOUND:CHEMBL598951" | "BRAZILIN"
"UMLS:C3554724" | "Complete duplication of thumb phalanx"
"OMIM:MTHU005989" | "Progressive disorder due to secondary myopathy"
"VANDF:4023749" | "Fungi nail"
"UMLS:C2987267" | "Esophageal Synovial Sarcoma"
finnagin commented 2 years ago

@amykglen @saramsey is this still relevant?

saramsey commented 1 year ago

@amykglen should we close out this issue, or transfer it to the PloverDB project area, or transfer it to the RTX-KG2 project area?

amykglen commented 1 year ago

hmm, I suppose we should probably keep this open. we have #RTXteam/RTX-KG2#63 for tracking this problem in KG2pre, but this one is to track the issue in KG2c, whose code still lives in the RTX repo (and this isn't a Plover issue).

it might be relevant to entity resolution work as well (I suspect KG2c has some cycles that KG2pre does not, due to incorrect merging of concepts)