RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

idea for fancier Expand via NGD #1346

Open edeutsch opened 3 years ago

edeutsch commented 3 years ago

Breaking out from #1345

I probably should know about this, but I confess I don't: are we now able to Expand based purely on NGD? Is there a database of pre-computed top relationships? I recall we mused about this like a year ago, but I don't recall how it ended up. I would think all the predicates in NGD expansion would be just something bland like "associated_with" if not our made-up "biolink:has_normalized_google_distance_with"

ah, that's true, I forgot that the only accepted predicate should be some simple thing like you suggest. but the accepted node categories will be the same as for KG2.

yeah, we can expand using ngd (in some capacity) - the issue for that was #975. it doesn't use a database of pre-computed relationships at the moment; it's currently limited to connections in KG2c. so it first looks for neighbors of the input node in KG2c, and then computes ngd between that node and all its neighbors, and drops those with an NGD worse than a particular value.

I'm thinking a fun project for someone in the future:

?

dkoslicki commented 3 years ago

@chunyuma is this anything you would be interested in? It’s similar in spirit to your DTD database/expander

chunyuma commented 3 years ago

Thanks @dkoslicki and @edeutsch, I'm quite interested in this and I think it is also helpful for explainable DTD model to reduce the memory load when the size of KG2c is reduced

chunyuma commented 3 years ago

One problem that I concerned is: would this affect other databases (e.g. DTD) because we only consider the most important nodes? Some drugs or some diseases might be rare and might not be in the most important node list.

dkoslicki commented 3 years ago

I don’t think this would affect DTD: the idea of “important nodes” is, I think, just to reduce the total number of pairs that need to be computed. @edeutsch can clarify if he was thinking otherwise, but basically we would not be removing nodes from KG2/KG2C, but rather computing NGD on a subset of KG2/KG2C and storing them in a database, and intelligently use it as an expander

chunyuma commented 3 years ago

Ah, I see! Thanks @dkoslicki

edeutsch commented 3 years ago

Yes, that's correct. I think in our 6 million nodes the vast majority will never appear in a query and I think we can ignore for a first attempt at this. For example, do a search for ibuprofen and RXNORM in our KG and you get:

RXNORM:368840   Ibuprofen Oral Tablet [Genpril]     biolink:Drug
RXNORM:368823   Ibuprofen Oral Tablet [Ibu]     biolink:Drug
RXNORM:637192   Ibuprofen 10 MG/ML      biolink:Drug
RXNORM:637195   Ibuprofen 10 MG/ML [Neoprofen]      biolink:Drug
RXNORM:1722333  Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet [Advil Sinus Congestion and Pain]      biolink:Drug
RXNORM:1722329  Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG [Advil Sinus Congestion and Pain]      biolink:Drug
RXNORM:1722330  Ibuprofen / Phenylephrine Oral Tablet [Advil Sinus Congestion and Pain]     biolink:Drug
RXNORM:373693   Ibuprofen / Pseudoephedrine Oral Capsule        biolink:Drug
RXNORM:314047   Ibuprofen 50 MG Chewable Tablet     biolink:Drug
RXNORM:577191   Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Suspension      biolink:Drug
RXNORM:1300267  Ibuprofen 200 MG Oral Tablet [Proprinal]        biolink:Drug
RXNORM:142102   Ibuprofen 50 MG/ML Topical Spray        biolink:Drug
RXNORM:372455   Codeine / Ibuprofen Oral Tablet     biolink:Drug
RXNORM:372456   Codeine / Ibuprofen Extended Release Oral Tablet        biolink:Drug
RXNORM:372449   Ibuprofen Extended Release Oral Tablet      biolink:Drug
RXNORM:201126   Ibuprofen 200 MG Oral Tablet [Motrin]       biolink:Drug
RXNORM:36761    ibuprofen lysine        biolink:Drug
RXNORM:484259   Ibuprofen / Oxycodone       biolink:Drug
RXNORM:1300263  Ibuprofen 200 MG [Proprinal]        biolink:Drug
RXNORM:1300264  Ibuprofen Oral Tablet [Proprinal]       biolink:Drug
RXNORM:392668   Ibuprofen 0.05 MG/MG / LEVOMENTHOL 0.03 MG/MG Topical Gel       biolink:Drug
RXNORM:392617   Ibuprofen / Menthol     biolink:Drug
RXNORM:1297369  Chlorpheniramine Maleate 0.2 MG/ML / Ibuprofen 20 MG/ML / Pseudoephedrine Hydrochloride 3 MG/ML Oral Suspension     biolink:Drug
RXNORM:380819   Ibuprofen Topical Foam      biolink:Drug
RXNORM:1297390  Chlorpheniramine Maleate 2 MG / Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet      biolink:Drug
RXNORM:367939   Ibuprofen / Pseudoephedrine Oral Tablet [Advil Cold and Sinus]      biolink:Drug
RXNORM:333683   Ibuprofen 40 MG/ML      biolink:Drug
RXNORM:1295502  Ibuprofen Chewable Product      biolink:Drug
RXNORM:1090449  Ibuprofen / Pseudoephedrine Oral Tablet [Wal-Profen Cold and Sinus]     biolink:Drug
RXNORM:1158493  Famotidine / Ibuprofen Oral Product     biolink:Drug
RXNORM:5640 Ibuprofen       biolink:Drug
RXNORM:202098   Ibuprofen 800 MG Oral Tablet [Motrin]       biolink:Drug
RXNORM:643059   Diphenhydramine / Ibuprofen Oral Tablet     biolink:Drug
RXNORM:637197   2 ML Ibuprofen 10 MG/ML Injection [Neoprofen]       biolink:Drug
RXNORM:544393   Ibuprofen 20 MG/ML Oral Suspension [Motrin]     biolink:Drug
RXNORM:544391   Ibuprofen 20 MG/ML [Motrin]     biolink:Drug
RXNORM:544392   Ibuprofen Oral Suspension [Motrin]      biolink:Drug
RXNORM:1007410  Carisoprodol / Ibuprofen        biolink:Drug
RXNORM:1007329  Ibuprofen / Phenylephrine       biolink:Drug
RXNORM:1007373  Ibuprofen / Vitamin B 12        biolink:Drug
RXNORM:1007917  Hydroxocobalamin / Ibuprofen        biolink:Drug
RXNORM:1007482  Ibuprofen / Lidocaine       biolink:Drug
RXNORM:1369775  Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet        biolink:Drug
RXNORM:1007823  cyclonium / Ibuprofen       biolink:Drug
RXNORM:2045474  Ibuprofen Oral Tablet [Dragon Tabs]     biolink:Drug
RXNORM:2045473  Ibuprofen 200 MG [Dragon Tabs]      biolink:Drug
RXNORM:2045477  Ibuprofen 200 MG Oral Tablet [Dragon Tabs]      biolink:Drug
RXNORM:567707   Ibuprofen 400 MG [Ibu]      biolink:Drug
RXNORM:567715   Ibuprofen 600 MG [Ibu]      biolink:Drug
RXNORM:567719   Ibuprofen 800 MG [Ibu]      biolink:Drug
RXNORM:1299021  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet      biolink:Drug
RXNORM:1299022  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet [Advil Cold and Sinus]       biolink:Drug
RXNORM:1299020  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Capsule [Advil Cold and Sinus]      biolink:Drug
RXNORM:1299018  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Capsule     biolink:Drug
RXNORM:1299019  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG [Advil Cold and Sinus]       biolink:Drug
RXNORM:643063   Diphenhydramine / Ibuprofen Oral Tablet [Advil PM]      biolink:Drug
RXNORM:814985   Ibuprofen / Tolperisone     biolink:Drug
RXNORM:643100   Ibuprofen 200 MG [Wal-Profen]       biolink:Drug
RXNORM:643101   Ibuprofen Oral Tablet [Wal-Profen]      biolink:Drug
RXNORM:643102   Ibuprofen 200 MG Oral Tablet [Wal-Profen]       biolink:Drug
RXNORM:393432   Ibuprofen 0.1 MG/MG     biolink:Drug
RXNORM:393550   Ibuprofen / LEVOMENTHOL Topical Gel     biolink:Drug
RXNORM:368308   Hydrocodone / Ibuprofen Oral Tablet [Vicoprofen]        biolink:Drug
RXNORM:1159018  Famotidine / Ibuprofen Pill     biolink:Drug
RXNORM:565689   Ibuprofen 200 MG [Motrin]       biolink:Drug
RXNORM:854761   Ibuprofen 40 MG/ML [Motrin]     biolink:Drug
RXNORM:854762   Ibuprofen 40 MG/ML Oral Suspension [Motrin]     biolink:Drug
RXNORM:795911   Ibuprofen / Pseudoephedrine Oral Capsule [Advil Cold and Sinus]     biolink:Drug
RXNORM:335000   Ibuprofen 50 MG/ML      biolink:Drug
RXNORM:645634   Diphenhydramine / Ibuprofen Oral Capsule        biolink:Drug
RXNORM:1429044  Ibuprofen, Sodium Salt      biolink:Drug
RXNORM:565143   Ibuprofen 200 MG [Advil]        biolink:Drug
RXNORM:854183   8 ML Ibuprofen 100 MG/ML Injection      biolink:Drug
RXNORM:854182   Ibuprofen 100 MG/ML     biolink:Drug
RXNORM:854185   Ibuprofen 100 MG/ML [Caldolor]      biolink:Drug
RXNORM:854187   8 ML Ibuprofen 100 MG/ML Injection [Caldolor]       biolink:Drug
RXNORM:197803   Ibuprofen 20 MG/ML Oral Suspension      biolink:Drug
RXNORM:197806   Ibuprofen 600 MG Oral Tablet        biolink:Drug
RXNORM:197805   Ibuprofen 400 MG Oral Tablet        biolink:Drug
RXNORM:197807   Ibuprofen 800 MG Oral Tablet        biolink:Drug
RXNORM:993798   Ibuprofen / Phenylephrine Oral Tablet       biolink:Drug
RXNORM:566095   Ibuprofen 800 MG [Motrin]       biolink:Drug
RXNORM:1008079  homatropine / Ibuprofen     biolink:Drug
RXNORM:820465   Carisoprodol / Dexamethasone / Ibuprofen        biolink:Drug
RXNORM:1008170  Ibuprofen / Niacin      biolink:Drug
RXNORM:380845   Ibuprofen 0.05 MG/MG        biolink:Drug
RXNORM:198405   Ibuprofen 100 MG Oral Tablet        biolink:Drug
RXNORM:380813   Ibuprofen 300 MG Extended Release Oral Capsule      biolink:Drug
RXNORM:380812   Ibuprofen Extended Release Oral Capsule     biolink:Drug
RXNORM:821036   Chlorzoxazone / Ibuprofen       biolink:Drug
RXNORM:1008502  Ibuprofen / pseudoisocytidine       biolink:Drug
RXNORM:1165305  Ibuprofen / Oxycodone Oral Product      biolink:Drug
RXNORM:1165307  Ibuprofen / Phenylephrine Oral Product      biolink:Drug
RXNORM:1165306  Ibuprofen / Oxycodone Pill      biolink:Drug
RXNORM:1165309  Ibuprofen / Pseudoephedrine Oral Liquid Product     biolink:Drug
RXNORM:1165308  Ibuprofen / Phenylephrine Pill      biolink:Drug
RXNORM:1165310  Ibuprofen / Pseudoephedrine Oral Product        biolink:Drug
RXNORM:1165311  Ibuprofen / Pseudoephedrine Pill        biolink:Drug
RXNORM:316074   Ibuprofen 200 MG        biolink:Drug
RXNORM:316073   Ibuprofen 20 MG/ML      biolink:Drug
RXNORM:316076   Ibuprofen 50 MG     biolink:Drug
RXNORM:316075   Ibuprofen 300 MG        biolink:Drug
RXNORM:316078   Ibuprofen 800 MG        biolink:Drug
RXNORM:316077   Ibuprofen 600 MG        biolink:Drug
RXNORM:316072   Ibuprofen 100 MG        biolink:Drug
RXNORM:1008440  Ibuprofen / Scopolamine     biolink:Drug
RXNORM:1165299  Ibuprofen / LEVOMENTHOL Topical Product     biolink:Drug
RXNORM:1940584  Ibuprofen / Phenylephrine Oral Tablet [Wal-Profen Congestion Relief and Pain]       biolink:Drug
RXNORM:1940583  Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG [Wal-Profen Congestion Relief and Pain]        biolink:Drug
RXNORM:1940587  Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet [Wal-Profen Congestion Relief and Pain]        biolink:Drug
RXNORM:389244   Ibuprofen 0.1 MG/MG Topical Gel     biolink:Drug
RXNORM:644895   Diphenhydramine / Ibuprofen     biolink:Drug
RXNORM:900434   Ibuprofen 200 MG Oral Tablet [Addaprin]     biolink:Drug
RXNORM:900433   Ibuprofen Oral Tablet [Addaprin]        biolink:Drug
RXNORM:900432   Ibuprofen 200 MG [Addaprin]     biolink:Drug
RXNORM:644386   Ibuprofen 200 MG Oral Capsule [Wal-Profen]      biolink:Drug
RXNORM:644385   Ibuprofen Oral Capsule [Wal-Profen]     biolink:Drug
RXNORM:93574    Ibuprofen Oral Tablet [Nuprin]      biolink:Drug
RXNORM:1009128  Caffeine / Ergotamine / Ibuprofen       biolink:Drug
RXNORM:1009037  Ibuprofen / Methocarbamol       biolink:Drug
RXNORM:204442   Ibuprofen 40 MG/ML Oral Suspension      biolink:Drug
RXNORM:1152222  Diphenhydramine / Ibuprofen Oral Product        biolink:Drug
RXNORM:1152223  Diphenhydramine / Ibuprofen Pill        biolink:Drug
RXNORM:606989   Ibuprofen Oral Capsule [Motrin]     biolink:Drug
RXNORM:606990   Ibuprofen 200 MG Oral Capsule [Motrin]      biolink:Drug
RXNORM:317388   Ibuprofen 400 MG        biolink:Drug
RXNORM:724134   Hydrocodone / Ibuprofen Oral Tablet [Reprexain]     biolink:Drug
RXNORM:206917   Ibuprofen 800 MG Oral Tablet [Ibu]      biolink:Drug
RXNORM:206913   Ibuprofen 600 MG Oral Tablet [Ibu]      biolink:Drug
RXNORM:206905   Ibuprofen 400 MG Oral Tablet [Ibu]      biolink:Drug
RXNORM:2178275  200 ML Ibuprofen 4 MG/ML Injection [Caldolor]       biolink:Drug
RXNORM:2178273  200 ML Ibuprofen 4 MG/ML Injection      biolink:Drug
RXNORM:2178274  Ibuprofen 4 MG/ML [Caldolor]        biolink:Drug
RXNORM:2178272  Ibuprofen 4 MG/ML       biolink:Drug
RXNORM:377956   Ibuprofen Topical Gel       biolink:Drug
RXNORM:758973   Hydrocodone / Ibuprofen Oral Tablet [Ibudone]       biolink:Drug
RXNORM:901814   Diphenhydramine Hydrochloride 25 MG / Ibuprofen 200 MG Oral Capsule     biolink:Drug
RXNORM:901817   Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG [Advil PM]     biolink:Drug
RXNORM:901818   Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Tablet [Advil PM]     biolink:Drug
RXNORM:817356   Acetaminophen / Codeine / Ibuprofen     biolink:Drug
RXNORM:1049589  Ibuprofen 400 MG / Oxycodone Hydrochloride 5 MG Oral Tablet     biolink:Drug
RXNORM:1299088  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG [Wal-Profen Cold and Sinus]      biolink:Drug
RXNORM:1299089  Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet [Wal-Profen Cold and Sinus]      biolink:Drug
RXNORM:567695   Ibuprofen 200 MG [Nuprin]       biolink:Drug
RXNORM:567680   Ibuprofen 20 MG/ML [Advil]      biolink:Drug
RXNORM:567688   Ibuprofen 200 MG [Genpril]      biolink:Drug
RXNORM:710303   Codeine / Ibuprofen     biolink:Drug
RXNORM:401976   Ibuprofen 300 MG / Pseudoephedrine 45 MG Oral Capsule       biolink:Drug
RXNORM:1310487  Ibuprofen 20 MG/ML / Pseudoephedrine Hydrochloride 3 MG/ML Oral Suspension      biolink:Drug
RXNORM:1310499  Chlorpheniramine / Ibuprofen / Phenylephrine Oral Product       biolink:Drug
RXNORM:895658   Diphenhydramine / Ibuprofen Oral Tablet [Motrin PM]     biolink:Drug
RXNORM:895666   Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Tablet [Motrin PM]        biolink:Drug
RXNORM:895664   Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Tablet        biolink:Drug
RXNORM:895665   Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG [Motrin PM]        biolink:Drug
RXNORM:1310502  Chlorpheniramine / Ibuprofen / Phenylephrine        biolink:Drug
RXNORM:1310503  Chlorpheniramine Maleate 4 MG / Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet        biolink:Drug
RXNORM:1310500  Chlorpheniramine / Ibuprofen / Phenylephrine Pill       biolink:Drug
RXNORM:1310501  Chlorpheniramine / Ibuprofen / Phenylephrine Oral Tablet        biolink:Drug
RXNORM:377325   Ibuprofen Topical Spray     biolink:Drug
RXNORM:250418   Ibuprofen 800 MG Extended Release Oral Tablet       biolink:Drug
RXNORM:1100064  Famotidine / Ibuprofen Oral Tablet      biolink:Drug
RXNORM:1100065  Famotidine / Ibuprofen      biolink:Drug
RXNORM:1100068  Famotidine 26.6 MG / Ibuprofen 800 MG [Duexis]      biolink:Drug
RXNORM:1100069  Famotidine / Ibuprofen Oral Tablet [Duexis]     biolink:Drug
RXNORM:1100066  Famotidine 26.6 MG / Ibuprofen 800 MG Oral Tablet       biolink:Drug
RXNORM:1100070  Famotidine 26.6 MG / Ibuprofen 800 MG Oral Tablet [Duexis]      biolink:Drug
RXNORM:483322   Ibuprofen / Oxycodone Oral Tablet       biolink:Drug
RXNORM:226617   Ibuprofen 50 MG/ML Topical Foam     biolink:Drug
RXNORM:214652   Ibuprofen / Pseudoephedrine     biolink:Drug
RXNORM:792241   Ibuprofen Chewable Tablet [Motrin]      biolink:Drug
RXNORM:792240   Ibuprofen 100 MG [Motrin]       biolink:Drug
RXNORM:792242   Ibuprofen 100 MG Chewable Tablet [Motrin]       biolink:Drug
RXNORM:214627   Hydrocodone / Ibuprofen     biolink:Drug
RXNORM:902632   Diphenhydramine / Ibuprofen Oral Capsule [Advil PM Liqui Gels]      biolink:Drug
RXNORM:902633   Diphenhydramine Hydrochloride 25 MG / Ibuprofen 200 MG Oral Capsule [Advil PM Liqui Gels]       biolink:Drug
RXNORM:902631   Diphenhydramine Hydrochloride 25 MG / Ibuprofen 200 MG [Advil PM Liqui Gels]        biolink:Drug
RXNORM:153008   Ibuprofen 200 MG Oral Tablet [Advil]        biolink:Drug
RXNORM:377732   Ibuprofen Topical Cream     biolink:Drug
RXNORM:370674   Ibuprofen Oral Tablet       biolink:Drug
RXNORM:370673   Ibuprofen Chewable Tablet       biolink:Drug
RXNORM:370672   Ibuprofen Oral Suspension       biolink:Drug
RXNORM:370678   Ibuprofen / Pseudoephedrine Oral Tablet     biolink:Drug
RXNORM:370677   Ibuprofen / Pseudoephedrine Oral Suspension     biolink:Drug
RXNORM:370676   Hydrocodone / Ibuprofen Oral Tablet     biolink:Drug
RXNORM:370675   Ibuprofen Oral Capsule      biolink:Drug
RXNORM:1359097  Ibuprofen 200 MG Oral Tablet [Ibutab]       biolink:Drug
RXNORM:1359093  Ibuprofen 200 MG [Ibutab]       biolink:Drug
RXNORM:1359094  Ibuprofen Oral Tablet [Ibutab]      biolink:Drug
RXNORM:818102   Acetaminophen / Ibuprofen       biolink:Drug
RXNORM:206878   Ibuprofen 20 MG/ML Oral Suspension [Advil]      biolink:Drug
RXNORM:206886   Ibuprofen 200 MG Oral Tablet [Genpril]      biolink:Drug
RXNORM:206893   Ibuprofen 200 MG Oral Tablet [Nuprin]       biolink:Drug
RXNORM:404789   Chlorpheniramine / Ibuprofen / Pseudoephedrine      biolink:Drug
RXNORM:1154775  Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Liquid Product      biolink:Drug
RXNORM:1154776  Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Product     biolink:Drug
RXNORM:1154777  Chlorpheniramine / Ibuprofen / Pseudoephedrine Pill     biolink:Drug
RXNORM:1154818  Codeine / Ibuprofen Oral Product        biolink:Drug
RXNORM:1154819  Codeine / Ibuprofen Pill        biolink:Drug
RXNORM:1791362  Ibuprofen Injection [Caldolor]      biolink:Drug
RXNORM:1791366  Ibuprofen Injection [Neoprofen]     biolink:Drug
RXNORM:859331   Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet [Reprexain]     biolink:Drug
RXNORM:859330   Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG [Reprexain]     biolink:Drug
RXNORM:859315   Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet     biolink:Drug
RXNORM:859317   Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet [Ibudone]       biolink:Drug
RXNORM:859316   Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG [Ibudone]       biolink:Drug
RXNORM:310965   Ibuprofen 200 MG Oral Tablet        biolink:Drug
RXNORM:310963   Ibuprofen 100 MG Chewable Tablet        biolink:Drug
RXNORM:310964   Ibuprofen 200 MG Oral Capsule       biolink:Drug
RXNORM:1101917  Ibuprofen 200 MG [Counteract IB]        biolink:Drug
RXNORM:1101918  Ibuprofen Oral Tablet [Counteract IB]       biolink:Drug
RXNORM:1101919  Ibuprofen 200 MG Oral Tablet [Counteract IB]        biolink:Drug
RXNORM:731528   Ibuprofen Chewable Tablet [Advil]       biolink:Drug
RXNORM:731529   Ibuprofen 50 MG Chewable Tablet [Advil]     biolink:Drug
RXNORM:731527   Ibuprofen 50 MG [Advil]     biolink:Drug
RXNORM:731535   Ibuprofen 100 MG Oral Tablet [Advil]        biolink:Drug
RXNORM:731536   Ibuprofen 100 MG Chewable Tablet [Advil]        biolink:Drug
RXNORM:731533   Ibuprofen 200 MG Oral Capsule [Advil]       biolink:Drug
RXNORM:731534   Ibuprofen 100 MG [Advil]        biolink:Drug
RXNORM:731531   Ibuprofen 40 MG/ML Oral Suspension [Advil]      biolink:Drug
RXNORM:731532   Ibuprofen Oral Capsule [Advil]      biolink:Drug
RXNORM:731530   Ibuprofen 40 MG/ML [Advil]      biolink:Drug
RXNORM:227159   Ibuprofen 200 MG Extended Release Oral Capsule      biolink:Drug
RXNORM:858798   Hydrocodone Bitartrate 7.5 MG / Ibuprofen 200 MG Oral Tablet        biolink:Drug
RXNORM:858783   Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG [Reprexain]      biolink:Drug
RXNORM:858780   Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG Oral Tablet [Ibudone]        biolink:Drug
RXNORM:858784   Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG Oral Tablet [Reprexain]      biolink:Drug
RXNORM:858772   Hydrocodone Bitartrate 2.5 MG / Ibuprofen 200 MG Oral Tablet [Reprexain]        biolink:Drug
RXNORM:858771   Hydrocodone Bitartrate 2.5 MG / Ibuprofen 200 MG [Reprexain]        biolink:Drug
RXNORM:858770   Hydrocodone Bitartrate 2.5 MG / Ibuprofen 200 MG Oral Tablet        biolink:Drug
RXNORM:858779   Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG [Ibudone]        biolink:Drug
RXNORM:858778   Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG Oral Tablet      biolink:Drug
RXNORM:1292323  Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Capsule       biolink:Drug
RXNORM:541713   Ibuprofen 800 MG Oral Tablet [Samson 8]     biolink:Drug
RXNORM:541712   Ibuprofen Oral Tablet [Samson 8]        biolink:Drug
RXNORM:541711   Ibuprofen 800 MG [Samson 8]     biolink:Drug
RXNORM:93358    Ibuprofen Oral Tablet [Motrin]      biolink:Drug
RXNORM:1542984  Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG [Xylon]     biolink:Drug
RXNORM:1542988  Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet [Xylon]     biolink:Drug
RXNORM:1542985  Hydrocodone / Ibuprofen Oral Tablet [Xylon]     biolink:Drug
RXNORM:1747293  Ibuprofen Injection     biolink:Drug
RXNORM:1747294  2 ML Ibuprofen 10 MG/ML Injection       biolink:Drug
RXNORM:687386   Ibuprofen / LEVOMENTHOL     biolink:Drug
RXNORM:858838   Hydrocodone Bitartrate 7.5 MG / Ibuprofen 200 MG Oral Tablet [Vicoprofen]       biolink:Drug
RXNORM:858837   Hydrocodone Bitartrate 7.5 MG / Ibuprofen 200 MG [Vicoprofen]       biolink:Drug
RXNORM:379847   Ibuprofen 3 MG/ML       biolink:Drug
RXNORM:850424   Ibuprofen 200 MG Oral Tablet [Ibuprohm]     biolink:Drug
RXNORM:850423   Ibuprofen Oral Tablet [Ibuprohm]        biolink:Drug
RXNORM:850422   Ibuprofen 200 MG [Ibuprohm]     biolink:Drug
RXNORM:2184152  Ibuprofen 200 MG / Phenylephrine Hydrochloride 5 MG Oral Tablet     biolink:Drug
RXNORM:997280   Codeine Phosphate 20 MG / Ibuprofen 300 MG Extended Release Oral Tablet     biolink:Drug
RXNORM:1156280  Ibuprofen Topical Product       biolink:Drug
RXNORM:1156275  Ibuprofen Injectable Product        biolink:Drug
RXNORM:1156278  Ibuprofen Pill      biolink:Drug
RXNORM:1156277  Ibuprofen Oral Product      biolink:Drug
RXNORM:1156276  Ibuprofen Oral Liquid Product       biolink:Drug
RXNORM:997165   Codeine Phosphate 12.8 MG / Ibuprofen 200 MG Oral Tablet        biolink:Drug
RXNORM:997164   Codeine Phosphate 12.5 MG / Ibuprofen 200 MG Oral Tablet        biolink:Drug
RXNORM:365861   Ibuprofen Oral Suspension [Advil]       biolink:Drug
RXNORM:806013   Ibuprofen 100 MG Oral Tablet [Motrin]       biolink:Drug
RXNORM:1597118  Chondroitin Sulfates / Glucosamine / Ibuprofen      biolink:Drug
RXNORM:91703    Ibuprofen Oral Tablet [Advil]       biolink:Drug
RXNORM:141998   Ibuprofen 50 MG/ML Topical Cream        biolink:Drug
RXNORM:141997   Ibuprofen 0.05 MG/MG Topical Gel        biolink:Drug
RXNORM:141993   Ibuprofen 3 MG/ML Oral Suspension       biolink:Drug
RXNORM:851211   60 (caffeine 65 MG / riboflavin 6.25 MG / thiamine 25 MG / vitamin B 12 0.125 MG / vitamin B6 25 MG Oral Capsule) / 60 (ibuprofen 800 MG Oral Tablet) Pack  biolink:Drug
RXNORM:1162789  Hydrocodone / Ibuprofen Pill        biolink:Drug
RXNORM:1162788  Hydrocodone / Ibuprofen Oral Product        biolink:Drug
RXNORM:405928   Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Tablet      biolink:Drug

I wonder if we can simply this list further so we only compute on a handful of these rather than the huge list.

chunyuma commented 3 years ago

I'm not sure if this method can remove some of generic concepts in KG2c, but just points out this problem here. I think some of nodes in KG2c (Please see the list below) have generic semantic meaning which might also never appear in a query (eg. MONDO:0004992 which is cancer and SO:0001217 which is protein_coding_gene). These nodes normally have extremely high in degree.

Please ignore the accuracy of category column below because the table is summarized from my local version of KG2c which excluded some node types (e.g. biolink:NamedThing, biolink:MolecularEnitty) and caused NodeSynonymizer to assign some wrong categories.

curie_id name category indegree outdegree
SO:0001217 protein_coding_gene biolink:Gene 97419 0
LOINC:LP208893-0 Pt biolink:Procedure 83179 1
CHEMBL.COMPOUND:CHEMBL87852 Hexadecanoic acid (S)-2-hexadecanoyloxy-1-hydr... biolink:ChemicalSubstance 59922 20571
UMLS:C0025255 Membrane biolink:GrossAnatomicalStructure 59623 2243
CHEMBL.COMPOUND:CHEMBL307679 Phosphoric acid mono-[5-(4-amino-2-oxo-2H-pyri... biolink:ChemicalSubstance 57431 35842
CHEMBL.COMPOUND:CHEMBL1623949   biolink:ChemicalSubstance 54698 51477
CHEMBL.COMPOUND:CHEMBL2286758 1-palmitoyl-2-(3-trans)-hexadecenoyl-sn-glycer... biolink:ChemicalSubstance 50788 31647
KEGG:C00269 CDP-diacylglycerol biolink:Metabolite 42988 30127
LOINC:LP7753-9 Qn biolink:Procedure 41873 0
CHEMBL.COMPOUND:CHEMBL3343985 Trilinolein biolink:ChemicalSubstance 39900 11874
DRUGBANK:DB03429 Tetrastearoyl cardiolipin biolink:ChemicalSubstance 38460 20150
MONDO:0000001 disease or disorder biolink:Disease 26125 9246
UMLS:C0007634 Cell biolink:Cell 25666 8887
LOINC:LP7751-3 Ord biolink:Procedure 24643 0
LOINC:LP7567-3 Ser biolink:Procedure 21673 0
MONDO:0004992 cancer biolink:Disease 21311 10623
CHEBI:15378 hydron biolink:ChemicalSubstance 21017 54538
CHEBI:36080 protein biolink:Protein 20927 1032
CHEMBL.COMPOUND:CHEMBL1098659 WATER biolink:ChemicalSubstance 19740 60653
PR:000029067 Homo sapiens protein biolink:Protein 19108 1
PR:000029032 Mus musculus protein biolink:Protein 17115 1
LOINC:LA4634-7 Patient biolink:Procedure 16877 0
UMLS:C0040300 Portion of tissue biolink:GrossAnatomicalStructure 16106 2477
PR:000029045 Arabidopsis thaliana protein biolink:Protein 15834 1
CHEMBL.COMPOUND:CHEMBL1488784 SID11113658 biolink:ChemicalSubstance 15825 16530
OMIM:MTHU000046 Growth biolink:PhenotypicFeature 15342 2644
CHEMBL.COMPOUND:CHEMBL3321993 TF biolink:ChemicalSubstance 14334 12663
LOINC:LP20667-9 Ab biolink:Procedure 14307 0
UMLS:C0006104 Brain biolink:GrossAnatomicalStructure 13475 686
VANDF:4017451 Liver biolink:ChemicalSubstance 13091 833
LOINC:MTHU000096 Microbiology biolink:Procedure 12785 1
dkoslicki commented 3 years ago

Here’s an oddball idea: if a bioentity never shows up in any pubmed abstract, it’s probably not “too important.” Wouldn’t get rid of terms like “Microbiology” and “brain”, but would things like “ 1-palmitoyl-2-(3-trans)-hexadecenoyl-sn-glycer...” And just a side note: I think some care will be needed for the generic terms. I have seen SME queries that ask things like “which genes are expressed in the liver?” So we would want that generic term.

edeutsch commented 3 years ago

That is an interesting question for the FastNGDers (@finnagin @amykglen ?) of the 6.1 million nodes in KG2.5.2C, how many have at least one PMID associated with it in our database? That alone may chop the list down substantially. Although probably not enough. One thing doesn't seem to make sense to me. KG2.5.2 has 10 millions nodes, while KG2.5.2C has 6 million nodes. Not a big drop. Yet, nearly every concept in KG2C that I've cared about has had at least a dozen nodes in the cluster. So this suggests that there are millions of nodes that probably have no friends and I wonder if they're useful.

As an example, I do notice that we have 1.78 million nodes that are just NCBITaxons. I wonder if this is really a useful thing. I wonder if we could remove 1.77 million NCBITaxon nodes without sacrificing any practical query capability..

amykglen commented 3 years ago

yeah, I believe only 1.6 million KG2c nodes have one or more PMIDs in the fast NGD database. helps quite a bit for sure, though 1.6m * 1.6m is probably still too much. :)

(and indeed I think the majority of nodes in KG2c are almost never returned in ARAX queries. for example, it's by far the nodes with PMIDs that happen to be returned in ARAX queries; that's why the fastNGD 'hit rate' is in the 99% range, even though only a quarter of the KG2c nodes have any PMIDs in the fastNGD database.)

edeutsch commented 3 years ago

I have some fanatical programming friends who insist that the smallest possible program that can still do the job is the best one. I wonder if some element of this ethos can be applied to KG2C? What is the smallest possible number of nodes we can have without sacrificing much at all?

amykglen commented 3 years ago

good question. :) I believe @timsyoon found that there are about 1.9 million isolated nodes in KG2c. those will of course never be returned in Translator queries, since they're not connected to anything. that's a good chunk right there we could probably get rid of with zero impact!

edeutsch commented 3 years ago

of course on the flip side, one could argue that those are exactly the kind of nodes that we want to look for edges for. So that they become connected!

Just not the ones that are "ibuprofen 21 mg", "ibuprofen 37 mg" etc.

dkoslicki commented 3 years ago

1.6M^2 = 2.56 trillion shouldn't be too much ;) if we start with those at least and keep track of the hit rate, I think that could work. Just need more silicon to throw at the problem

chunyuma commented 3 years ago

1.6M^2 = 2.56 trillion shouldn't be too much

@dkoslicki, based on my investigation, running 2.56 trillion in parallel in our server probably needs 139 days and even more under the situation which doesn't affect other users' jobs. We might need more computational resources.

dkoslicki commented 3 years ago

NCATS has provided us funds to do such large scale computations (and thankfully, as opposed to DTD, this database will rarely if ever need to be updated, and even then, the whole thing will not need to be updated, just new entries).

Let me know approximately how many core hours this would take, and I can see what ACI can do for us.

chunyuma commented 3 years ago

@dkoslicki, I basically used the same approach as what I did for building DTD probability database. For each of 16M nodes, I submitted a job for calculating the ngd score between this node and all other 16M nodes. Each job uses only one process by using the map function in python (For some reasons, I found that using map function runs even faster than multiprocessing.). For each job, it consumes:

User time (seconds): 99.06
System time (seconds): 17.46
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:48.18
Maximum resident set size (kbytes): 17342956 (~17GB)

So I can only run around 25 jobs each time which consumes around 400 - 500 GB RAM. Theoretically, each job only uses one core and around 17GB but if running multiple jobs at the same time in the same server might affect each other. So I think it would be better if we can get some computational resources from ACI which can automatically assign different jobs to different cores which can afford ~17GB RAM.

@dkoslicki, if you remember, previously we did purchase some virtual cluster from ACI, but until now they can't help us resolve the job allocation problem, which means that we can't submit too many jobs at the same time. Let‘s say if we submit 1000 jobs at the same time, it might causes some problems regarding the job allocations for the vcore.

chunyuma commented 3 years ago

Let's assume each job doesn't affect each other, each job might cost around 2 minutes for calculating the ngd score between one curie and other 16M nodes. Since we totally have 1,672,684 nodes, we can finish all computations around a week (1672684/ (30 times/per hour x 24 hours/per day x 300) = 7.74 days) if we can submit 300 jobs at the same time.

edeutsch commented 3 years ago

I'm thinking it would be sensible to try a small-scale experiment to see if the approach yields useful results before starting thousands of hours of computation . Is there already a pilot? Perhaps run just the MONDOs against all 20k Swiss-Prot reviewed proteins? Can we reproduce some known connections? Can we generate some plausible new ones that we would want to report?

chunyuma commented 3 years ago

Perhaps run just the MONDOs against all 20k Swiss-Prot reviewed proteins?

Great idea, @edeutsch. I can have a try.

chunyuma commented 3 years ago

Can we reproduce some known connections? Can we generate some plausible new ones that we would want to report?

Hi @edeutsch, I have already computed the ngd scores of all MONDOs against all 20k Swiss-Prot reviewed proteins. How can we know if it can reproduce some known connections? Or generate some plausible new ones? Is there a threshold to filter them for checking some known connections?

dkoslicki commented 3 years ago

Can you post a plot and some summary statistics of the NGDs that you calculated @chunyuma? That will help in determining what constitutes a meaningfully "small" NGD score

edeutsch commented 3 years ago

Agreed, and I also think that looking at a few examples would be useful.

Example 1: MONDO:0013989

{
   "edges": {
      "e00": {
         "subject":   "n00",
         "object":    "n01"
      }
   },
   "nodes": {
      "n00": {
         "ids":        ["MONDO:0013989"]
      },
      "n01": {
         "categories":  ["biolink:Protein"]
      }
   }
}

Current ARAX is returning 69 results, but only the top 5 have NGDs. The rest, no NGDs. What are the top 50 proteins for MONDO:0013989 based on your calculation? Do they overlap with the current answer?

2) Example of a current case where we have nothing: MONDO:0014001:

{
   "edges": {
      "e00": {
         "subject":   "n00",
         "object":    "n01"
      }
   },
   "nodes": {
      "n00": {
         "ids":        ["MONDO:0014001"]
      },
      "n01": {
         "categories":  ["biolink:Protein"]
      }
   }
}

This returns nothing. What are the top 50 NGD links from your computation? Are there any?

chunyuma commented 3 years ago

Thanks @dkoslicki and @edeutsch. Based on curie_to_pmids_v1.0_KG2.6.3.sqlite database, there are total 22,464 UniProKB proteins and 13,689 MONDO curies.

Here are statistics of the NGDs calculation: Only 11,743 MONDO curies have at least one valid ngd score. Only 20,467 UniProKB curies have at least one valid ngd score.

count 4.156227e+07 mean 3.728171e-01 std 1.346087e-01 min 2.382131e-03 25% 2.755543e-01 50% 3.507656e-01 75% 4.456183e-01 max 1.204312e+00

Here is the distribution of all NGD scores for all MONDOs against all 20k Swiss-Prot reviewed proteins

Screen Shot 2021-06-07 at 2 16 35 PM

For example 1: MONDO:0013989, here are the top 50 proteins:

MONDO protein ngd_score
MONDO:0013989 UniProtKB:Q6UVM3 0.135096
MONDO:0013989 UniProtKB:Q15822 0.139354
MONDO:0013989 UniProtKB:Q9H936 0.156994
MONDO:0013989 UniProtKB:P17787 0.161801
MONDO:0013989 UniProtKB:Q9P2E7 0.175193
MONDO:0013989 UniProtKB:O43526 0.180483
MONDO:0013989 UniProtKB:O76039 0.184613
MONDO:0013989 UniProtKB:O43307 0.203182
MONDO:0013989 UniProtKB:P61764 0.203552
MONDO:0013989 UniProtKB:Q86Y07 0.207103
MONDO:0013989 UniProtKB:Q07699 0.215191
MONDO:0013989 UniProtKB:Q8N7X2 0.215684
MONDO:0013989 UniProtKB:Q96MP8 0.216447
MONDO:0013989 UniProtKB:Q5RIA9 0.218315
MONDO:0013989 UniProtKB:Q9H1X3 0.218315
MONDO:0013989 UniProtKB:Q9H2S1 0.218374
MONDO:0013989 UniProtKB:Q9P2G4 0.220513
MONDO:0013989 UniProtKB:Q96H35 0.220513
MONDO:0013989 UniProtKB:Q96MA6 0.222343
MONDO:0013989 UniProtKB:Q13303 0.223599
MONDO:0013989 UniProtKB:Q9NX38 0.223663
MONDO:0013989 UniProtKB:Q9BS92 0.224072
MONDO:0013989 UniProtKB:Q5VVW2 0.224072
MONDO:0013989 UniProtKB:Q3KQV9 0.224072
MONDO:0013989 UniProtKB:Q5JVG2 0.224072
MONDO:0013989 UniProtKB:Q96LW7 0.224072
MONDO:0013989 UniProtKB:Q86W47 0.225216
MONDO:0013989 UniProtKB:Q8NBV4 0.225563
MONDO:0013989 UniProtKB:Q5THR3 0.225563
MONDO:0013989 UniProtKB:O75121 0.225563
MONDO:0013989 UniProtKB:Q5VXU9 0.225563
MONDO:0013989 UniProtKB:Q6ZW05 0.226913
MONDO:0013989 UniProtKB:Q14929 0.226913
MONDO:0013989 UniProtKB:Q8N228 0.226913
MONDO:0013989 UniProtKB:Q96K62 0.226913
MONDO:0013989 UniProtKB:A2A3K4 0.226913
MONDO:0013989 UniProtKB:Q9P2F6 0.226913
MONDO:0013989 UniProtKB:Q6NUM6 0.226913
MONDO:0013989 UniProtKB:Q56UQ5 0.227683
MONDO:0013989 UniProtKB:Q8N0Z9 0.227683
MONDO:0013989 UniProtKB:Q6ZMW2 0.227683
MONDO:0013989 UniProtKB:Q96NJ1 0.227683
MONDO:0013989 UniProtKB:Q8NFD4 0.227683
MONDO:0013989 UniProtKB:Q6P2C0 0.227683
MONDO:0013989 UniProtKB:Q5T011 0.227914
MONDO:0013989 UniProtKB:Q5VTE6 0.228149
MONDO:0013989 UniProtKB:Q6PF06 0.228149
MONDO:0013989 UniProtKB:Q8N4T4 0.228149
MONDO:0013989 UniProtKB:Q9Y2H8 0.228149
MONDO:0013989 UniProtKB:Q6ZSA7 0.228149

For those top 5 with NGD returned by ARAX, only UniProtKB:Q6UVM3 is matched. For some reasons, UniProtKB:P78508 and UniProtKB:Q9NS40 are not in curie_to_pmids_v1.0_KG2.6.3.sqlite database. I guess probably ARAX is still using the old version of kg2 rather than 2.6.3.

For example 2: MONDO:0014001, it also doesn't have any ngd scores with any proteins. ARAX also reports an error No paths were found in {'BTE', 'RTX-KG2'} satisfying qedge e00 when I ran:

{
   "edges": {
      "e00": {
         "subject":   "n00",
         "object":    "n01"
      }
   },
   "nodes": {
      "n00": {
         "ids":        ["MONDO:0014001"]
      },
      "n01": {
         "categories":  ["biolink:Protein"]
      }
   }
}
edeutsch commented 3 years ago

@chunyuma would you generate the histogram with 0.01 NGD score resolution?

chunyuma commented 3 years ago

@edeutsch, here is the histogram with 0.01 resolution:

Screen Shot 2021-06-07 at 2 42 55 PM
edeutsch commented 3 years ago

For some reasons, UniProtKB:P78508 and UniProtKB:Q9NS40 are not in curie_to_pmids_v1.0_KG2.6.3.sqlite database. I guess probably ARAX is still using the old version of kg2 rather than 2.6.3.

ARAX is still using 2.5.2 since there are still too many issues with the 2.6.x series to deploy I think.

but I'm concerned about P78508. Are you saying that P78508 is not in KG2.6.3? Or there are no PMIDs associated with it?

Either way, this seems concerning and something we should follow up on? P78508 is a classic reviewed UniProtKB/Swiss-Prot protein, available since 1997 with many publications associated with it in UniProtKB. If we lost it, we should figure out why.

https://www.uniprot.org/uniprot/P78508

chunyuma commented 3 years ago

Are you saying that P78508 is not in KG2.6.3? Or there are no PMIDs associated with it?

I think v2.6.3 Nodesynonymizer clustered UniProtKB:P78508 with MONDO:0010134. And it seems like MONDO:0010134 also doesn't have PMIDs.

n.id n.category n.equivalent_curies n.publications
"MONDO:0010134" "biolink:Disease" ["CHEMBL.TARGET:CHEMBL2146348", "DOID:0060744", "ENSEMBL:ENSG00000091137", "ENSEMBL:ENSG00000168269", "ENSEMBL:ENSG00000177807", "HGNC:3815", "HGNC:6256", "HGNC:8818", "LOINC:LP35578-1", "MEDDRA:10080398", "MESH:C536648", "MONDO:0010134", "NCBIGene:2299", "NCBIGene:3766", "NCBIGene:5172", "NCIT:C121745", "OMIM:274600", "OMIM:601093", "OMIM:602208", "OMIM:605646", "ORPHANET:231422", "ORPHANET:705", "PR:000001979", "PR:000007625", "PR:P78508", "PR:Q12951", "REACT:R-HSA-425403", "REACT:R-HSA-5627850", "REACT:R-HSA-5627857", "REACT:R-HSA-5627860", "REACT:R-HSA-5627865", "REACT:R-HSA-5627873", "REACT:R-HSA-975290", "SNOMED:70348004", "UMLS:C0271829", "UMLS:C1414682", "UMLS:C1416577", "UMLS:C1418445", "UMLS:C3551785", "UniProtKB:O43511", "UniProtKB:P78508", "UniProtKB:Q12951"] ["2-r", "DOI:10.1001/jamaoto.2013.4185", "DOI:10.1002/(sici)1096-8628(20000103)90:1<38::aid-ajmg8>3.0.co", "DOI:10.1002/ajmg.a.20272", "DOI:10.1002/humu.1116", "DOI:10.1002/humu.1238", "DOI:10.1002/humu.20884", "DOI:10.1002/humu.23335", "DOI:10.1002/humu.9043", "DOI:10.1002/j.1460-2075.1994.tb06827.x"]
edeutsch commented 3 years ago

hmm, I suggest doing your experiment with KG2.5.2 because otherwise we will keep bumping into these KG2.6.x problems when we try to poke a little deeper. and it will be hard to compare what ARAX can currently produce to understand if we're getting an improvement.

chunyuma commented 3 years ago

ok, I can do it and should have results tomorrow or the day after tomorrow.

chunyuma commented 3 years ago

Based on KG2.5.2 NGD database, there are total 24,424 UniProKB proteins and 11,732 MONDO curies.

Here are statistics of the NGDs calculation: Only 9,375 MONDO curies have at least one valid ngd score. Only 22,136 UniProKB curies have at least one valid ngd score.

count 1.936320e+07 mean 3.185257e-01 std 1.480961e-01 min 2.114451e-03 25% 2.111894e-01 50% 2.895769e-01 75% 3.901364e-01 max 1.222913e+00

Here is the distribution of all NGD scores for all MONDOs against all 20k Swiss-Prot reviewed proteins Screen Shot 2021-06-07 at 10 23 24 PM

For example 1: MONDO:0013989, here are the top 50 proteins:

MONDO protein ngd_score
MONDO:0013989 UniProtKB:Q6UVM3 0.131265
MONDO:0013989 UniProtKB:Q96H35 0.161792
MONDO:0013989 UniProtKB:Q8N7X2 0.161792
MONDO:0013989 UniProtKB:Q5RIA9 0.161792
MONDO:0013989 UniProtKB:Q9H1X3 0.163765
MONDO:0013989 UniProtKB:Q8N9H8 0.165414
MONDO:0013989 UniProtKB:Q9P2G4 0.165414
MONDO:0013989 UniProtKB:Q96GE9 0.165414
MONDO:0013989 UniProtKB:Q8IYX7 0.165414
MONDO:0013989 UniProtKB:Q14929 0.165414
MONDO:0013989 UniProtKB:Q96J77 0.165414
MONDO:0013989 UniProtKB:Q9Y2H8 0.165414
MONDO:0013989 UniProtKB:Q86YN1 0.166834
MONDO:0013989 UniProtKB:Q8NE28 0.166834
MONDO:0013989 UniProtKB:Q5W0U4 0.166834
MONDO:0013989 UniProtKB:Q9UGQ2 0.166834
MONDO:0013989 UniProtKB:Q5VVW2 0.166834
MONDO:0013989 UniProtKB:Q9BS92 0.166834
MONDO:0013989 UniProtKB:Q9P2J8 0.166834
MONDO:0013989 UniProtKB:Q96E40 0.166834
MONDO:0013989 UniProtKB:Q5JVG2 0.166834
MONDO:0013989 UniProtKB:Q5VXU9 0.166834
MONDO:0013989 UniProtKB:A2A3K4 0.166834
MONDO:0013989 UniProtKB:Q9P2F6 0.166834
MONDO:0013989 UniProtKB:Q6PF06 0.166834
MONDO:0013989 UniProtKB:Q3KQV9 0.166834
MONDO:0013989 UniProtKB:Q8NBV4 0.166834
MONDO:0013989 UniProtKB:Q5T6V5 0.168084
MONDO:0013989 UniProtKB:Q86XA9 0.168084
MONDO:0013989 UniProtKB:Q8TF39 0.168084
MONDO:0013989 UniProtKB:Q9P2P1 0.168084
MONDO:0013989 UniProtKB:Q5TYW1 0.169202
MONDO:0013989 UniProtKB:Q96LW7 0.169202
MONDO:0013989 UniProtKB:Q9Y6Q3 0.169202
MONDO:0013989 UniProtKB:Q6IPU0 0.169202
MONDO:0013989 UniProtKB:Q8N4T4 0.169202
MONDO:0013989 UniProtKB:Q9NVG8 0.169202
MONDO:0013989 UniProtKB:Q5VST6 0.169202
MONDO:0013989 UniProtKB:Q8N5N7 0.169202
MONDO:0013989 UniProtKB:O94769 0.170215
MONDO:0013989 UniProtKB:Q96GR4 0.170215
MONDO:0013989 UniProtKB:Q9P2D6 0.170215
MONDO:0013989 UniProtKB:Q9P2N2 0.170215
MONDO:0013989 UniProtKB:Q8NCR6 0.170215
MONDO:0013989 UniProtKB:Q9NVS9 0.170739
MONDO:0013989 UniProtKB:Q4ADV7 0.171006
MONDO:0013989 UniProtKB:Q712K3 0.171142
MONDO:0013989 UniProtKB:Q8N539 0.171142
MONDO:0013989 UniProtKB:Q9Y614 0.171142
MONDO:0013989 UniProtKB:Q6ZV29 0.171142

We can match to the top 3 proteins out of top 5 with NGDs returned by ARAX.

For example 2: MONDO:0014001, it also doesn't have any ngd scores with any proteins.

edeutsch commented 3 years ago

Only 9,375 MONDO curies have at least one valid ngd score.

1) So of these 9375, can you determine how many of these have 0, 1, 2, 3+ known KG2C edges to proteins?

2) Of the 9375 MONDO curies that have 1 or 2 known KG2C edges to proteins, for how many (and which ones) does the NGD method reproduce them?

3) How many (and which) of the 9375 have 0 known KG2C edges to proteins?

4) Can you point to an example where this method finds a MONDO to UniProtKB NGD association that does not exist in KG2C, but that can be verified as reasonable by reading one of the implicated papers or by some other means? i.e., can you find an example that demonstrates that this approach really finds something valuable?

thanks!

chunyuma commented 3 years ago
  1. So of these 9375, can you determine how many of these have 0, 1, 2, 3+ known KG2C edges to proteins?

None of these 9375 has 0 known KG2C edges to proteins 765 out of 9375 have 1 667 out of 9375 have 2 5686 have 3+

  1. Of the 9375 MONDO curies that have 1 or 2 known KG2C edges to proteins, for how many (and which ones) does the NGD method reproduce them?

Of the 9375 MONDO curies that have 1 or 2 known KG2C edges to proteins, there are 1217 MONDO curies which have at least one MONDO-protein pair that is in KG2c and can be reproduced by the NGD method. For these MONDO curies, there are total 1428 MONDO-protein pairs. Since they are too many, I'm not listing them here.

  1. How many (and which) of the 9375 have 0 known KG2C edges to proteins?

I guess you're asking how many of these 9375 have 0 known KG2C edges to proteins that the NGD method reproduce? Otherwise, it should be the same as the question 1. There are 346 out of 9375 which have 0 known KG2C edges to proteins that the NGD method produces.

Here is the list of them:

['MONDO:0008824',
 'MONDO:0013493',
 'MONDO:0017449',
 'MONDO:0002411',
 'MONDO:0008117',
 'MONDO:0016032',
 'MONDO:0018543',
 'MONDO:0007361',
 'MONDO:0009309',
 'MONDO:0006688',
 'MONDO:0003197',
 'MONDO:0015675',
 'MONDO:0010657',
 'MONDO:0011818',
 'MONDO:0019214',
 'MONDO:0002027',
 'MONDO:0008537',
 'MONDO:0017426',
 'MONDO:0002158',
 'MONDO:0016368',
 'MONDO:0016242',
 'MONDO:0011842',
 'MONDO:0012237',
 'MONDO:0012081',
 'MONDO:0014226',
 'MONDO:0011224',
 'MONDO:0008269',
 'MONDO:0018448',
 'MONDO:0013714',
 'MONDO:0008482',
 'MONDO:0002967',
 'MONDO:0003147',
 'MONDO:0024519',
 'MONDO:0001053',
 'MONDO:0013385',
 'MONDO:0016567',
 'MONDO:0016707',
 'MONDO:0004848',
 'MONDO:0000115',
 'MONDO:0020124',
 'MONDO:0011313',
 'MONDO:0054698',
 'MONDO:0013573',
 'MONDO:0017593',
 'MONDO:0019807',
 'MONDO:0008148',
 'MONDO:0000754',
 'MONDO:0024463',
 'MONDO:0024456',
 'MONDO:0006696',
 'MONDO:0008990',
 'MONDO:0010020',
 'MONDO:0009433',
 'MONDO:0006821',
 'MONDO:0003633',
 'MONDO:0036591',
 'MONDO:0001235',
 'MONDO:0008679',
 'MONDO:0006008',
 'MONDO:0010367',
 'MONDO:0006771',
 'MONDO:0006850',
 'MONDO:0016991',
 'MONDO:0002523',
 'MONDO:0009368',
 'MONDO:0019725',
 'MONDO:0009970',
 'MONDO:0007001',
 'MONDO:0007636',
 'MONDO:0020204',
 'MONDO:0005743',
 'MONDO:0010780',
 'MONDO:0019371',
 'MONDO:0002839',
 'MONDO:0021804',
 'MONDO:0014711',
 'MONDO:0004112',
 'MONDO:0011870',
 'MONDO:0015009',
 'MONDO:0004666',
 'MONDO:0011342',
 'MONDO:0056795',
 'MONDO:0004845',
 'MONDO:0005640',
 'MONDO:0010997',
 'MONDO:0006616',
 'MONDO:0006996',
 'MONDO:0017304',
 'MONDO:0020542',
 'MONDO:0007122',
 'MONDO:0004633',
 'MONDO:0004866',
 'MONDO:0008939',
 'MONDO:0009588',
 'MONDO:0011018',
 'MONDO:0013343',
 'MONDO:0020381',
 'MONDO:0004672',
 'MONDO:0007723',
 'MONDO:0005731',
 'MONDO:0002920',
 'MONDO:0011162',
 'MONDO:0005624',
 'MONDO:0021169',
 'MONDO:0001074',
 'MONDO:0002688',
 'MONDO:0019078',
 'MONDO:0001404',
 'MONDO:0015304',
 'MONDO:0016979',
 'MONDO:0021020',
 'MONDO:0012173',
 'MONDO:0018456',
 'MONDO:0015053',
 'MONDO:0009054',
 'MONDO:0003741',
 'MONDO:0007781',
 'MONDO:0021366',
 'MONDO:0015522',
 'MONDO:0011891',
 'MONDO:0013099',
 'MONDO:0010302',
 'MONDO:0003182',
 'MONDO:0016426',
 'MONDO:0007105',
 'MONDO:0007543',
 'MONDO:0007662',
 'MONDO:0000741',
 'MONDO:0018631',
 'MONDO:0032644',
 'MONDO:0001797',
 'MONDO:0018466',
 'MONDO:0008547',
 'MONDO:0005909',
 'MONDO:0019497',
 'MONDO:0017160',
 'MONDO:0018170',
 'MONDO:0006534',
 'MONDO:0008263',
 'MONDO:0005460',
 'MONDO:0011462',
 'MONDO:0010490',
 'MONDO:0012731',
 'MONDO:0020298',
 'MONDO:0013400',
 'MONDO:0020300',
 'MONDO:0003701',
 'MONDO:0011806',
 'MONDO:0006481',
 'MONDO:0010571',
 'MONDO:0020944',
 'MONDO:0001854',
 'MONDO:0000750',
 'MONDO:0008292',
 'MONDO:0015048',
 'MONDO:0009537',
 'MONDO:0020507',
 'MONDO:0020713',
 'MONDO:0012277',
 'MONDO:0011907',
 'MONDO:0014084',
 'MONDO:0013843',
 'MONDO:0014070',
 'MONDO:0006629',
 'MONDO:0007878',
 'MONDO:0014937',
 'MONDO:0009595',
 'MONDO:0006891',
 'MONDO:0000426',
 'MONDO:0012651',
 'MONDO:0019448',
 'MONDO:0001935',
 'MONDO:0014684',
 'MONDO:0019967',
 'MONDO:0019780',
 'MONDO:0008954',
 'MONDO:0007709',
 'MONDO:0007798',
 'MONDO:0018214',
 'MONDO:0005969',
 'MONDO:0008953',
 'MONDO:0011452',
 'MONDO:0007796',
 'MONDO:0019951',
 'MONDO:0019642',
 'MONDO:0011921',
 'MONDO:0013150',
 'MONDO:0005667',
 'MONDO:0012157',
 'MONDO:0009415',
 'MONDO:0004139',
 'MONDO:0018690',
 'MONDO:0024610',
 'MONDO:0001431',
 'MONDO:0045019',
 'MONDO:0009624',
 'MONDO:0021839',
 'MONDO:0007791',
 'MONDO:0006605',
 'MONDO:0008593',
 'MONDO:0005945',
 'MONDO:0020366',
 'MONDO:0007415',
 'MONDO:0004349',
 'MONDO:0013577',
 'MONDO:0060690',
 'MONDO:0001834',
 'MONDO:0007722',
 'MONDO:0001600',
 'MONDO:0011413',
 'MONDO:0004638',
 'MONDO:0019374',
 'MONDO:0005910',
 'MONDO:0011546',
 'MONDO:0014219',
 'MONDO:0012497',
 'MONDO:0007946',
 'MONDO:0017825',
 'MONDO:0008693',
 'MONDO:0015273',
 'MONDO:0007454',
 'MONDO:0002354',
 'MONDO:0011866',
 'MONDO:0001915',
 'MONDO:0008666',
 'MONDO:0011139',
 'MONDO:0011374',
 'MONDO:0008637',
 'MONDO:0000859',
 'MONDO:0002102',
 'MONDO:0011932',
 'MONDO:0006995',
 'MONDO:0018045',
 'MONDO:0013288',
 'MONDO:0020352',
 'MONDO:0006711',
 'MONDO:0010142',
 'MONDO:0012611',
 'MONDO:0014255',
 'MONDO:0005753',
 'MONDO:0000966',
 'MONDO:0018198',
 'MONDO:0008334',
 'MONDO:0015748',
 'MONDO:0019804',
 'MONDO:0016418',
 'MONDO:0009870',
 'MONDO:0019677',
 'MONDO:0001479',
 'MONDO:0009728',
 'MONDO:0008332',
 'MONDO:0008722',
 'MONDO:0007990',
 'MONDO:0044768',
 'MONDO:0001801',
 'MONDO:0020356',
 'MONDO:0009424',
 'MONDO:0006447',
 'MONDO:0008102',
 'MONDO:0060593',
 'MONDO:0000158',
 'MONDO:0008105',
 'MONDO:0001830',
 'MONDO:0014178',
 'MONDO:0007867',
 'MONDO:0014592',
 'MONDO:0016256',
 'MONDO:0007377',
 'MONDO:0018604',
 'MONDO:0020843',
 'MONDO:0010704',
 'MONDO:0021941',
 'MONDO:0016489',
 'MONDO:0017086',
 'MONDO:0005190',
 'MONDO:0009953',
 'MONDO:0008230',
 'MONDO:0013049',
 'MONDO:0019758',
 'MONDO:0013781',
 'MONDO:0009428',
 'MONDO:0008371',
 'MONDO:0014984',
 'MONDO:0014945',
 'MONDO:0004838',
 'MONDO:0007007',
 'MONDO:0008004',
 'MONDO:0016225',
 'MONDO:0022963',
 'MONDO:0010779',
 'MONDO:0005829',
 'MONDO:0010149',
 'MONDO:0016557',
 'MONDO:0015275',
 'MONDO:0006986',
 'MONDO:0002962',
 'MONDO:0015128',
 'MONDO:0016001',
 'MONDO:0002332',
 'MONDO:0018597',
 'MONDO:0022236',
 'MONDO:0013930',
 'MONDO:0013824',
 'MONDO:0018521',
 'MONDO:0024337',
 'MONDO:0007677',
 'MONDO:0005912',
 'MONDO:0018784',
 'MONDO:0007741',
 'MONDO:0005787',
 'MONDO:0008635',
 'MONDO:0017413',
 'MONDO:0001301',
 'MONDO:0043310',
 'MONDO:0024546',
 'MONDO:0010137',
 'MONDO:0009628',
 'MONDO:0009733',
 'MONDO:0012839',
 'MONDO:0003127',
 'MONDO:0006569',
 'MONDO:0005774',
 'MONDO:0012399',
 'MONDO:0001297',
 'MONDO:0008705',
 'MONDO:0008736',
 'MONDO:0014160',
 'MONDO:0010880',
 'MONDO:0006638',
 'MONDO:0010884',
 'MONDO:0012723',
 'MONDO:0006766',
 'MONDO:0021334',
 'MONDO:0013564',
 'MONDO:0020722',
 'MONDO:0005757',
 'MONDO:0002516',
 'MONDO:0044740',
 'MONDO:0015534',
 'MONDO:0002968',
 'MONDO:0017560',
 'MONDO:0004348',
 'MONDO:0007338',
 'MONDO:0008180',
 'MONDO:0021140',
 'MONDO:0015016']
  1. Can you point to an example where this method finds a MONDO to UniProtKB NGD association that does not exist in KG2C, but that can be verified as reasonable by reading one of the implicated papers or by some other means? i.e., can you find an example that demonstrates that this approach really finds something valuable?

This might need more time to do investigation.

edeutsch commented 3 years ago

great, thanks, this looks promising!

    So of these 9375, can you determine how many of these have 0, 1, 2, 3+ known KG2C edges to proteins?

None of these 9375 has 0 known KG2C edges to proteins 765 out of 9375 have 1 667 out of 9375 have 2 5686 have 3+

hmm, but 765 + 667 + 5686 = 7118 . Where are the other (9375-7118) = 2257?

chunyuma commented 3 years ago

@edeutsch, sorry, my mistake. The Cypher query didn't return the MONDO curies with 0 known connected protein in kg2c. So the rest 2257 don't have any known KG2C edges to proteins.

edeutsch commented 3 years ago

ah, yes, that seems closer to what I expected. 0 would have been (was) very surprising.

So then question 3 is still relevant. The first part of the answer is 2257. The second part is which ones (I suppose that's a very long list). So more importantly, can you find examples in the 2257 that are demonstrably good? Or demonstrably bad? I suppose it would be useful to pick ~5 at random, and examine them carefully by looking at the returned PMIDs. Are a NGD results A: good, B: apparently bad, C: can't tell.

chunyuma commented 3 years ago

@edeutsch, I think some of them still have many PMIDs. I randomly picked 50 and here is a table summarizing their PMIDs:

curie pmids num_pmids
MONDO:0024503 [28490723, 29970437, 30373866, 30595757, 23091... 7
MONDO:0002433 [23240704, 13795331, 9232390, 7700489, 9773066... 3955
MONDO:0017385 [30525185, 31054119, 25524840, 29037447, 26784... 12
MONDO:0016668 [27423233, 3750401, 871430, 23218697, 7573002,... 195
MONDO:0019079 [31516794, 25497877, 30061431] 3
MONDO:0016736 [25432191, 30783393, 28904578, 31250151, 23163... 16
MONDO:0007426 [9164800, 7632899, 23681028, 8534023, 19976200... 927
MONDO:0009105 [31132033, 29383842, 26526116, 27050310, 28944... 9
MONDO:0015986 [23573507, 7198213, 27123211, 3920396, 7304721... 147
MONDO:0024674 [4414979, 14290436, 7303684, 28913160, 2963457... 154
MONDO:0008641 [29322432, 30627749, 29386495] 3
MONDO:0005949 [18821121, 21450753, 23955460, 16918534, 17147... 1090
MONDO:0002518 [25351203, 20446223, 24106418, 9244117, 294947... 6
MONDO:0016362 [29169633, 22933892, 27753051, 23891684] 4
MONDO:0000367 [21749764, 23240712, 16236553, 20643854, 20807... 2167
MONDO:0003043 [27588097, 28258179, 8774664, 24716041, 106269... 58
MONDO:0016680 [28120069, 20975976, 30604394, 29760590, 27578... 8
MONDO:0016595 [29263873, 26903555, 25454087, 13319688, 22291... 200
MONDO:0022513 [24636648, 16476392, 28389037, 30847185, 81134... 6
MONDO:0043218 [27465216, 31438759, 30323116, 30797209, 31060... 5
MONDO:0018755 [29799424, 27912864, 28655685, 25984198, 29657... 11
MONDO:0007861 [7888390, 9678474, 9014285, 11506318, 19116567... 25
MONDO:0018892 [2598528, 30451208, 2191247, 17182351, 1157569... 37
MONDO:0017790 [28275686, 29968043, 31594255, 29103540, 26363... 6
MONDO:0006271 [8272897, 27990273, 9083523, 26046099, 2704437... 34
MONDO:0006607 [14237705, 13457422, 13457423, 13076496, 10586... 377
MONDO:0019498 [12340744, 29264904, 28293130, 19831312, 93646... 148
MONDO:0044354 [26622464, 23818241, 30034819, 23955459, 27142... 82
MONDO:0006922 [8421890, 15297539, 21964802, 25416710, 447898... 315
MONDO:0043988 [15334402, 14719363, 23342212, 10980741, 21467... 54
MONDO:0029001 [27366016, 27307137, 28578820, 23966726, 27528... 49
MONDO:0011324 [28988429, 20949527] 2
MONDO:0018212 [25500256] 1
MONDO:0021017 [31091456, 27918210, 30185603, 28024462, 26667... 43
MONDO:0021060 [25487361, 31825160, 23379592, 30693642, 30041... 69
MONDO:0003209 [28506304, 25861345, 26581569, 17721187, 28888... 19
MONDO:0009493 [30863896, 23954873, 6425460, 23954222] 4
MONDO:0018058 [28178944, 24113157, 10549768, 23901199, 22470... 85
MONDO:0005762 [18393603, 9893380, 9893381, 9893382, 9893383,... 198
MONDO:0001594 [10205697, 2351626, 10036748, 17409040, 192993... 164
MONDO:0006455 [28168064, 10779032, 18480395, 25979154, 18804... 7
MONDO:0003125 [31147264, 22525408, 24518791, 20850634, 24755... 10
MONDO:0002168 [13669698, 9762889, 16568146, 2323283, 2371225... 6
MONDO:0004088 [31393622, 20923443, 30695899, 24464879] 4
MONDO:0004491 [2389106, 16720931] 2
MONDO:0003928 [26866354] 1
MONDO:0003051 [9653909] 1
MONDO:0011596 [16912508] 1
MONDO:0000763 [19236704] 1
MONDO:0013446 [11283794] 1

And it seems like some of them have pretty good NGD results:

MONDO protein ngd_score
MONDO:0009105 UniProtKB:Q6PGP7 0.115045
MONDO:0009493 UniProtKB:P03901 0.166939
MONDO:0000763 UniProtKB:Q9GZX3 0.185350
MONDO:0009105 UniProtKB:Q9BYG5 0.190188
MONDO:0021060 UniProtKB:Q15814 0.197264
... ... ...
MONDO:0002433 UniProtKB:Q07812 0.867421
MONDO:0007426 UniProtKB:P01375 0.890680
MONDO:0002433 UniProtKB:P15692 0.894198
MONDO:0002433 UniProtKB:P05019 0.915273
MONDO:0002433 UniProtKB:P29474 0.925709
finnagin commented 3 years ago

Spot checking that first pair on the bottom table with an ngd of 0.115045 looks good:

https://www.pombase.org/term/MONDO:0009105 - tricho-hepato-enteric syndrome https://www.uniprot.org/uniprot/Q6PGP7 - Tricho-hepatic-enteric syndrome protein

the protein name contains the name of the syndrome

edeutsch commented 3 years ago

Thanks @chunyuma @finnagin I think this is showing great promise. I think it would be useful to pick a few more such cases at random and document here whether they seem likely good or not. @finnagin 's example above is clearly very good. Would you spot check a few more?

For this checked one above, there are also 21 results found by ARAX that are from BTE I think: https://arax.ncats.io/?r=12764 so that seems verified. But not novel (maybe novel from KG2, but not novel from BTE.

It would be fun to find an example for which we cannot find any known edges in KG2 or BTE but yet we find one or more NGD associations that can be verified as being plausible by looking at the publications

Assuming that pans out well, then I think we have the evidence to go forward with finishing implementation in Expand() How would be implement it? I'm thinking it would be useful to store the top 50 hits between every MONDO:x and UniProtKB:x identifiers in a SQLite database and then figure out how to query it as a data source.

Would we always call it as another data source? or would be only call it if we came up try after checking "real" data sources? I don't know, but let's try stuff!

While we're doing that, it would be great to try to expand the set to every MONDO:x to DRUGBANK:x identifier. And maybe DRUGBANK:x to UniProtKB:x. And then again find a few examples of node pairs that we could not previously connect at all that come up with apparently good hits and validate them.

thanks!