Open edeutsch opened 3 years ago
@chunyuma is this anything you would be interested in? It’s similar in spirit to your DTD database/expander
Thanks @dkoslicki and @edeutsch, I'm quite interested in this and I think it is also helpful for explainable DTD model to reduce the memory load when the size of KG2c is reduced
One problem that I concerned is: would this affect other databases (e.g. DTD) because we only consider the most important nodes? Some drugs or some diseases might be rare and might not be in the most important node list.
I don’t think this would affect DTD: the idea of “important nodes” is, I think, just to reduce the total number of pairs that need to be computed. @edeutsch can clarify if he was thinking otherwise, but basically we would not be removing nodes from KG2/KG2C, but rather computing NGD on a subset of KG2/KG2C and storing them in a database, and intelligently use it as an expander
Ah, I see! Thanks @dkoslicki
Yes, that's correct. I think in our 6 million nodes the vast majority will never appear in a query and I think we can ignore for a first attempt at this. For example, do a search for ibuprofen and RXNORM in our KG and you get:
RXNORM:368840 Ibuprofen Oral Tablet [Genpril] biolink:Drug
RXNORM:368823 Ibuprofen Oral Tablet [Ibu] biolink:Drug
RXNORM:637192 Ibuprofen 10 MG/ML biolink:Drug
RXNORM:637195 Ibuprofen 10 MG/ML [Neoprofen] biolink:Drug
RXNORM:1722333 Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet [Advil Sinus Congestion and Pain] biolink:Drug
RXNORM:1722329 Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG [Advil Sinus Congestion and Pain] biolink:Drug
RXNORM:1722330 Ibuprofen / Phenylephrine Oral Tablet [Advil Sinus Congestion and Pain] biolink:Drug
RXNORM:373693 Ibuprofen / Pseudoephedrine Oral Capsule biolink:Drug
RXNORM:314047 Ibuprofen 50 MG Chewable Tablet biolink:Drug
RXNORM:577191 Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Suspension biolink:Drug
RXNORM:1300267 Ibuprofen 200 MG Oral Tablet [Proprinal] biolink:Drug
RXNORM:142102 Ibuprofen 50 MG/ML Topical Spray biolink:Drug
RXNORM:372455 Codeine / Ibuprofen Oral Tablet biolink:Drug
RXNORM:372456 Codeine / Ibuprofen Extended Release Oral Tablet biolink:Drug
RXNORM:372449 Ibuprofen Extended Release Oral Tablet biolink:Drug
RXNORM:201126 Ibuprofen 200 MG Oral Tablet [Motrin] biolink:Drug
RXNORM:36761 ibuprofen lysine biolink:Drug
RXNORM:484259 Ibuprofen / Oxycodone biolink:Drug
RXNORM:1300263 Ibuprofen 200 MG [Proprinal] biolink:Drug
RXNORM:1300264 Ibuprofen Oral Tablet [Proprinal] biolink:Drug
RXNORM:392668 Ibuprofen 0.05 MG/MG / LEVOMENTHOL 0.03 MG/MG Topical Gel biolink:Drug
RXNORM:392617 Ibuprofen / Menthol biolink:Drug
RXNORM:1297369 Chlorpheniramine Maleate 0.2 MG/ML / Ibuprofen 20 MG/ML / Pseudoephedrine Hydrochloride 3 MG/ML Oral Suspension biolink:Drug
RXNORM:380819 Ibuprofen Topical Foam biolink:Drug
RXNORM:1297390 Chlorpheniramine Maleate 2 MG / Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet biolink:Drug
RXNORM:367939 Ibuprofen / Pseudoephedrine Oral Tablet [Advil Cold and Sinus] biolink:Drug
RXNORM:333683 Ibuprofen 40 MG/ML biolink:Drug
RXNORM:1295502 Ibuprofen Chewable Product biolink:Drug
RXNORM:1090449 Ibuprofen / Pseudoephedrine Oral Tablet [Wal-Profen Cold and Sinus] biolink:Drug
RXNORM:1158493 Famotidine / Ibuprofen Oral Product biolink:Drug
RXNORM:5640 Ibuprofen biolink:Drug
RXNORM:202098 Ibuprofen 800 MG Oral Tablet [Motrin] biolink:Drug
RXNORM:643059 Diphenhydramine / Ibuprofen Oral Tablet biolink:Drug
RXNORM:637197 2 ML Ibuprofen 10 MG/ML Injection [Neoprofen] biolink:Drug
RXNORM:544393 Ibuprofen 20 MG/ML Oral Suspension [Motrin] biolink:Drug
RXNORM:544391 Ibuprofen 20 MG/ML [Motrin] biolink:Drug
RXNORM:544392 Ibuprofen Oral Suspension [Motrin] biolink:Drug
RXNORM:1007410 Carisoprodol / Ibuprofen biolink:Drug
RXNORM:1007329 Ibuprofen / Phenylephrine biolink:Drug
RXNORM:1007373 Ibuprofen / Vitamin B 12 biolink:Drug
RXNORM:1007917 Hydroxocobalamin / Ibuprofen biolink:Drug
RXNORM:1007482 Ibuprofen / Lidocaine biolink:Drug
RXNORM:1369775 Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet biolink:Drug
RXNORM:1007823 cyclonium / Ibuprofen biolink:Drug
RXNORM:2045474 Ibuprofen Oral Tablet [Dragon Tabs] biolink:Drug
RXNORM:2045473 Ibuprofen 200 MG [Dragon Tabs] biolink:Drug
RXNORM:2045477 Ibuprofen 200 MG Oral Tablet [Dragon Tabs] biolink:Drug
RXNORM:567707 Ibuprofen 400 MG [Ibu] biolink:Drug
RXNORM:567715 Ibuprofen 600 MG [Ibu] biolink:Drug
RXNORM:567719 Ibuprofen 800 MG [Ibu] biolink:Drug
RXNORM:1299021 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet biolink:Drug
RXNORM:1299022 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet [Advil Cold and Sinus] biolink:Drug
RXNORM:1299020 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Capsule [Advil Cold and Sinus] biolink:Drug
RXNORM:1299018 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Capsule biolink:Drug
RXNORM:1299019 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG [Advil Cold and Sinus] biolink:Drug
RXNORM:643063 Diphenhydramine / Ibuprofen Oral Tablet [Advil PM] biolink:Drug
RXNORM:814985 Ibuprofen / Tolperisone biolink:Drug
RXNORM:643100 Ibuprofen 200 MG [Wal-Profen] biolink:Drug
RXNORM:643101 Ibuprofen Oral Tablet [Wal-Profen] biolink:Drug
RXNORM:643102 Ibuprofen 200 MG Oral Tablet [Wal-Profen] biolink:Drug
RXNORM:393432 Ibuprofen 0.1 MG/MG biolink:Drug
RXNORM:393550 Ibuprofen / LEVOMENTHOL Topical Gel biolink:Drug
RXNORM:368308 Hydrocodone / Ibuprofen Oral Tablet [Vicoprofen] biolink:Drug
RXNORM:1159018 Famotidine / Ibuprofen Pill biolink:Drug
RXNORM:565689 Ibuprofen 200 MG [Motrin] biolink:Drug
RXNORM:854761 Ibuprofen 40 MG/ML [Motrin] biolink:Drug
RXNORM:854762 Ibuprofen 40 MG/ML Oral Suspension [Motrin] biolink:Drug
RXNORM:795911 Ibuprofen / Pseudoephedrine Oral Capsule [Advil Cold and Sinus] biolink:Drug
RXNORM:335000 Ibuprofen 50 MG/ML biolink:Drug
RXNORM:645634 Diphenhydramine / Ibuprofen Oral Capsule biolink:Drug
RXNORM:1429044 Ibuprofen, Sodium Salt biolink:Drug
RXNORM:565143 Ibuprofen 200 MG [Advil] biolink:Drug
RXNORM:854183 8 ML Ibuprofen 100 MG/ML Injection biolink:Drug
RXNORM:854182 Ibuprofen 100 MG/ML biolink:Drug
RXNORM:854185 Ibuprofen 100 MG/ML [Caldolor] biolink:Drug
RXNORM:854187 8 ML Ibuprofen 100 MG/ML Injection [Caldolor] biolink:Drug
RXNORM:197803 Ibuprofen 20 MG/ML Oral Suspension biolink:Drug
RXNORM:197806 Ibuprofen 600 MG Oral Tablet biolink:Drug
RXNORM:197805 Ibuprofen 400 MG Oral Tablet biolink:Drug
RXNORM:197807 Ibuprofen 800 MG Oral Tablet biolink:Drug
RXNORM:993798 Ibuprofen / Phenylephrine Oral Tablet biolink:Drug
RXNORM:566095 Ibuprofen 800 MG [Motrin] biolink:Drug
RXNORM:1008079 homatropine / Ibuprofen biolink:Drug
RXNORM:820465 Carisoprodol / Dexamethasone / Ibuprofen biolink:Drug
RXNORM:1008170 Ibuprofen / Niacin biolink:Drug
RXNORM:380845 Ibuprofen 0.05 MG/MG biolink:Drug
RXNORM:198405 Ibuprofen 100 MG Oral Tablet biolink:Drug
RXNORM:380813 Ibuprofen 300 MG Extended Release Oral Capsule biolink:Drug
RXNORM:380812 Ibuprofen Extended Release Oral Capsule biolink:Drug
RXNORM:821036 Chlorzoxazone / Ibuprofen biolink:Drug
RXNORM:1008502 Ibuprofen / pseudoisocytidine biolink:Drug
RXNORM:1165305 Ibuprofen / Oxycodone Oral Product biolink:Drug
RXNORM:1165307 Ibuprofen / Phenylephrine Oral Product biolink:Drug
RXNORM:1165306 Ibuprofen / Oxycodone Pill biolink:Drug
RXNORM:1165309 Ibuprofen / Pseudoephedrine Oral Liquid Product biolink:Drug
RXNORM:1165308 Ibuprofen / Phenylephrine Pill biolink:Drug
RXNORM:1165310 Ibuprofen / Pseudoephedrine Oral Product biolink:Drug
RXNORM:1165311 Ibuprofen / Pseudoephedrine Pill biolink:Drug
RXNORM:316074 Ibuprofen 200 MG biolink:Drug
RXNORM:316073 Ibuprofen 20 MG/ML biolink:Drug
RXNORM:316076 Ibuprofen 50 MG biolink:Drug
RXNORM:316075 Ibuprofen 300 MG biolink:Drug
RXNORM:316078 Ibuprofen 800 MG biolink:Drug
RXNORM:316077 Ibuprofen 600 MG biolink:Drug
RXNORM:316072 Ibuprofen 100 MG biolink:Drug
RXNORM:1008440 Ibuprofen / Scopolamine biolink:Drug
RXNORM:1165299 Ibuprofen / LEVOMENTHOL Topical Product biolink:Drug
RXNORM:1940584 Ibuprofen / Phenylephrine Oral Tablet [Wal-Profen Congestion Relief and Pain] biolink:Drug
RXNORM:1940583 Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG [Wal-Profen Congestion Relief and Pain] biolink:Drug
RXNORM:1940587 Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet [Wal-Profen Congestion Relief and Pain] biolink:Drug
RXNORM:389244 Ibuprofen 0.1 MG/MG Topical Gel biolink:Drug
RXNORM:644895 Diphenhydramine / Ibuprofen biolink:Drug
RXNORM:900434 Ibuprofen 200 MG Oral Tablet [Addaprin] biolink:Drug
RXNORM:900433 Ibuprofen Oral Tablet [Addaprin] biolink:Drug
RXNORM:900432 Ibuprofen 200 MG [Addaprin] biolink:Drug
RXNORM:644386 Ibuprofen 200 MG Oral Capsule [Wal-Profen] biolink:Drug
RXNORM:644385 Ibuprofen Oral Capsule [Wal-Profen] biolink:Drug
RXNORM:93574 Ibuprofen Oral Tablet [Nuprin] biolink:Drug
RXNORM:1009128 Caffeine / Ergotamine / Ibuprofen biolink:Drug
RXNORM:1009037 Ibuprofen / Methocarbamol biolink:Drug
RXNORM:204442 Ibuprofen 40 MG/ML Oral Suspension biolink:Drug
RXNORM:1152222 Diphenhydramine / Ibuprofen Oral Product biolink:Drug
RXNORM:1152223 Diphenhydramine / Ibuprofen Pill biolink:Drug
RXNORM:606989 Ibuprofen Oral Capsule [Motrin] biolink:Drug
RXNORM:606990 Ibuprofen 200 MG Oral Capsule [Motrin] biolink:Drug
RXNORM:317388 Ibuprofen 400 MG biolink:Drug
RXNORM:724134 Hydrocodone / Ibuprofen Oral Tablet [Reprexain] biolink:Drug
RXNORM:206917 Ibuprofen 800 MG Oral Tablet [Ibu] biolink:Drug
RXNORM:206913 Ibuprofen 600 MG Oral Tablet [Ibu] biolink:Drug
RXNORM:206905 Ibuprofen 400 MG Oral Tablet [Ibu] biolink:Drug
RXNORM:2178275 200 ML Ibuprofen 4 MG/ML Injection [Caldolor] biolink:Drug
RXNORM:2178273 200 ML Ibuprofen 4 MG/ML Injection biolink:Drug
RXNORM:2178274 Ibuprofen 4 MG/ML [Caldolor] biolink:Drug
RXNORM:2178272 Ibuprofen 4 MG/ML biolink:Drug
RXNORM:377956 Ibuprofen Topical Gel biolink:Drug
RXNORM:758973 Hydrocodone / Ibuprofen Oral Tablet [Ibudone] biolink:Drug
RXNORM:901814 Diphenhydramine Hydrochloride 25 MG / Ibuprofen 200 MG Oral Capsule biolink:Drug
RXNORM:901817 Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG [Advil PM] biolink:Drug
RXNORM:901818 Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Tablet [Advil PM] biolink:Drug
RXNORM:817356 Acetaminophen / Codeine / Ibuprofen biolink:Drug
RXNORM:1049589 Ibuprofen 400 MG / Oxycodone Hydrochloride 5 MG Oral Tablet biolink:Drug
RXNORM:1299088 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG [Wal-Profen Cold and Sinus] biolink:Drug
RXNORM:1299089 Ibuprofen 200 MG / Pseudoephedrine Hydrochloride 30 MG Oral Tablet [Wal-Profen Cold and Sinus] biolink:Drug
RXNORM:567695 Ibuprofen 200 MG [Nuprin] biolink:Drug
RXNORM:567680 Ibuprofen 20 MG/ML [Advil] biolink:Drug
RXNORM:567688 Ibuprofen 200 MG [Genpril] biolink:Drug
RXNORM:710303 Codeine / Ibuprofen biolink:Drug
RXNORM:401976 Ibuprofen 300 MG / Pseudoephedrine 45 MG Oral Capsule biolink:Drug
RXNORM:1310487 Ibuprofen 20 MG/ML / Pseudoephedrine Hydrochloride 3 MG/ML Oral Suspension biolink:Drug
RXNORM:1310499 Chlorpheniramine / Ibuprofen / Phenylephrine Oral Product biolink:Drug
RXNORM:895658 Diphenhydramine / Ibuprofen Oral Tablet [Motrin PM] biolink:Drug
RXNORM:895666 Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Tablet [Motrin PM] biolink:Drug
RXNORM:895664 Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:895665 Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG [Motrin PM] biolink:Drug
RXNORM:1310502 Chlorpheniramine / Ibuprofen / Phenylephrine biolink:Drug
RXNORM:1310503 Chlorpheniramine Maleate 4 MG / Ibuprofen 200 MG / Phenylephrine Hydrochloride 10 MG Oral Tablet biolink:Drug
RXNORM:1310500 Chlorpheniramine / Ibuprofen / Phenylephrine Pill biolink:Drug
RXNORM:1310501 Chlorpheniramine / Ibuprofen / Phenylephrine Oral Tablet biolink:Drug
RXNORM:377325 Ibuprofen Topical Spray biolink:Drug
RXNORM:250418 Ibuprofen 800 MG Extended Release Oral Tablet biolink:Drug
RXNORM:1100064 Famotidine / Ibuprofen Oral Tablet biolink:Drug
RXNORM:1100065 Famotidine / Ibuprofen biolink:Drug
RXNORM:1100068 Famotidine 26.6 MG / Ibuprofen 800 MG [Duexis] biolink:Drug
RXNORM:1100069 Famotidine / Ibuprofen Oral Tablet [Duexis] biolink:Drug
RXNORM:1100066 Famotidine 26.6 MG / Ibuprofen 800 MG Oral Tablet biolink:Drug
RXNORM:1100070 Famotidine 26.6 MG / Ibuprofen 800 MG Oral Tablet [Duexis] biolink:Drug
RXNORM:483322 Ibuprofen / Oxycodone Oral Tablet biolink:Drug
RXNORM:226617 Ibuprofen 50 MG/ML Topical Foam biolink:Drug
RXNORM:214652 Ibuprofen / Pseudoephedrine biolink:Drug
RXNORM:792241 Ibuprofen Chewable Tablet [Motrin] biolink:Drug
RXNORM:792240 Ibuprofen 100 MG [Motrin] biolink:Drug
RXNORM:792242 Ibuprofen 100 MG Chewable Tablet [Motrin] biolink:Drug
RXNORM:214627 Hydrocodone / Ibuprofen biolink:Drug
RXNORM:902632 Diphenhydramine / Ibuprofen Oral Capsule [Advil PM Liqui Gels] biolink:Drug
RXNORM:902633 Diphenhydramine Hydrochloride 25 MG / Ibuprofen 200 MG Oral Capsule [Advil PM Liqui Gels] biolink:Drug
RXNORM:902631 Diphenhydramine Hydrochloride 25 MG / Ibuprofen 200 MG [Advil PM Liqui Gels] biolink:Drug
RXNORM:153008 Ibuprofen 200 MG Oral Tablet [Advil] biolink:Drug
RXNORM:377732 Ibuprofen Topical Cream biolink:Drug
RXNORM:370674 Ibuprofen Oral Tablet biolink:Drug
RXNORM:370673 Ibuprofen Chewable Tablet biolink:Drug
RXNORM:370672 Ibuprofen Oral Suspension biolink:Drug
RXNORM:370678 Ibuprofen / Pseudoephedrine Oral Tablet biolink:Drug
RXNORM:370677 Ibuprofen / Pseudoephedrine Oral Suspension biolink:Drug
RXNORM:370676 Hydrocodone / Ibuprofen Oral Tablet biolink:Drug
RXNORM:370675 Ibuprofen Oral Capsule biolink:Drug
RXNORM:1359097 Ibuprofen 200 MG Oral Tablet [Ibutab] biolink:Drug
RXNORM:1359093 Ibuprofen 200 MG [Ibutab] biolink:Drug
RXNORM:1359094 Ibuprofen Oral Tablet [Ibutab] biolink:Drug
RXNORM:818102 Acetaminophen / Ibuprofen biolink:Drug
RXNORM:206878 Ibuprofen 20 MG/ML Oral Suspension [Advil] biolink:Drug
RXNORM:206886 Ibuprofen 200 MG Oral Tablet [Genpril] biolink:Drug
RXNORM:206893 Ibuprofen 200 MG Oral Tablet [Nuprin] biolink:Drug
RXNORM:404789 Chlorpheniramine / Ibuprofen / Pseudoephedrine biolink:Drug
RXNORM:1154775 Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Liquid Product biolink:Drug
RXNORM:1154776 Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Product biolink:Drug
RXNORM:1154777 Chlorpheniramine / Ibuprofen / Pseudoephedrine Pill biolink:Drug
RXNORM:1154818 Codeine / Ibuprofen Oral Product biolink:Drug
RXNORM:1154819 Codeine / Ibuprofen Pill biolink:Drug
RXNORM:1791362 Ibuprofen Injection [Caldolor] biolink:Drug
RXNORM:1791366 Ibuprofen Injection [Neoprofen] biolink:Drug
RXNORM:859331 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet [Reprexain] biolink:Drug
RXNORM:859330 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG [Reprexain] biolink:Drug
RXNORM:859315 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:859317 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet [Ibudone] biolink:Drug
RXNORM:859316 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG [Ibudone] biolink:Drug
RXNORM:310965 Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:310963 Ibuprofen 100 MG Chewable Tablet biolink:Drug
RXNORM:310964 Ibuprofen 200 MG Oral Capsule biolink:Drug
RXNORM:1101917 Ibuprofen 200 MG [Counteract IB] biolink:Drug
RXNORM:1101918 Ibuprofen Oral Tablet [Counteract IB] biolink:Drug
RXNORM:1101919 Ibuprofen 200 MG Oral Tablet [Counteract IB] biolink:Drug
RXNORM:731528 Ibuprofen Chewable Tablet [Advil] biolink:Drug
RXNORM:731529 Ibuprofen 50 MG Chewable Tablet [Advil] biolink:Drug
RXNORM:731527 Ibuprofen 50 MG [Advil] biolink:Drug
RXNORM:731535 Ibuprofen 100 MG Oral Tablet [Advil] biolink:Drug
RXNORM:731536 Ibuprofen 100 MG Chewable Tablet [Advil] biolink:Drug
RXNORM:731533 Ibuprofen 200 MG Oral Capsule [Advil] biolink:Drug
RXNORM:731534 Ibuprofen 100 MG [Advil] biolink:Drug
RXNORM:731531 Ibuprofen 40 MG/ML Oral Suspension [Advil] biolink:Drug
RXNORM:731532 Ibuprofen Oral Capsule [Advil] biolink:Drug
RXNORM:731530 Ibuprofen 40 MG/ML [Advil] biolink:Drug
RXNORM:227159 Ibuprofen 200 MG Extended Release Oral Capsule biolink:Drug
RXNORM:858798 Hydrocodone Bitartrate 7.5 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:858783 Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG [Reprexain] biolink:Drug
RXNORM:858780 Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG Oral Tablet [Ibudone] biolink:Drug
RXNORM:858784 Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG Oral Tablet [Reprexain] biolink:Drug
RXNORM:858772 Hydrocodone Bitartrate 2.5 MG / Ibuprofen 200 MG Oral Tablet [Reprexain] biolink:Drug
RXNORM:858771 Hydrocodone Bitartrate 2.5 MG / Ibuprofen 200 MG [Reprexain] biolink:Drug
RXNORM:858770 Hydrocodone Bitartrate 2.5 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:858779 Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG [Ibudone] biolink:Drug
RXNORM:858778 Hydrocodone Bitartrate 5 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:1292323 Diphenhydramine Citrate 38 MG / Ibuprofen 200 MG Oral Capsule biolink:Drug
RXNORM:541713 Ibuprofen 800 MG Oral Tablet [Samson 8] biolink:Drug
RXNORM:541712 Ibuprofen Oral Tablet [Samson 8] biolink:Drug
RXNORM:541711 Ibuprofen 800 MG [Samson 8] biolink:Drug
RXNORM:93358 Ibuprofen Oral Tablet [Motrin] biolink:Drug
RXNORM:1542984 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG [Xylon] biolink:Drug
RXNORM:1542988 Hydrocodone Bitartrate 10 MG / Ibuprofen 200 MG Oral Tablet [Xylon] biolink:Drug
RXNORM:1542985 Hydrocodone / Ibuprofen Oral Tablet [Xylon] biolink:Drug
RXNORM:1747293 Ibuprofen Injection biolink:Drug
RXNORM:1747294 2 ML Ibuprofen 10 MG/ML Injection biolink:Drug
RXNORM:687386 Ibuprofen / LEVOMENTHOL biolink:Drug
RXNORM:858838 Hydrocodone Bitartrate 7.5 MG / Ibuprofen 200 MG Oral Tablet [Vicoprofen] biolink:Drug
RXNORM:858837 Hydrocodone Bitartrate 7.5 MG / Ibuprofen 200 MG [Vicoprofen] biolink:Drug
RXNORM:379847 Ibuprofen 3 MG/ML biolink:Drug
RXNORM:850424 Ibuprofen 200 MG Oral Tablet [Ibuprohm] biolink:Drug
RXNORM:850423 Ibuprofen Oral Tablet [Ibuprohm] biolink:Drug
RXNORM:850422 Ibuprofen 200 MG [Ibuprohm] biolink:Drug
RXNORM:2184152 Ibuprofen 200 MG / Phenylephrine Hydrochloride 5 MG Oral Tablet biolink:Drug
RXNORM:997280 Codeine Phosphate 20 MG / Ibuprofen 300 MG Extended Release Oral Tablet biolink:Drug
RXNORM:1156280 Ibuprofen Topical Product biolink:Drug
RXNORM:1156275 Ibuprofen Injectable Product biolink:Drug
RXNORM:1156278 Ibuprofen Pill biolink:Drug
RXNORM:1156277 Ibuprofen Oral Product biolink:Drug
RXNORM:1156276 Ibuprofen Oral Liquid Product biolink:Drug
RXNORM:997165 Codeine Phosphate 12.8 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:997164 Codeine Phosphate 12.5 MG / Ibuprofen 200 MG Oral Tablet biolink:Drug
RXNORM:365861 Ibuprofen Oral Suspension [Advil] biolink:Drug
RXNORM:806013 Ibuprofen 100 MG Oral Tablet [Motrin] biolink:Drug
RXNORM:1597118 Chondroitin Sulfates / Glucosamine / Ibuprofen biolink:Drug
RXNORM:91703 Ibuprofen Oral Tablet [Advil] biolink:Drug
RXNORM:141998 Ibuprofen 50 MG/ML Topical Cream biolink:Drug
RXNORM:141997 Ibuprofen 0.05 MG/MG Topical Gel biolink:Drug
RXNORM:141993 Ibuprofen 3 MG/ML Oral Suspension biolink:Drug
RXNORM:851211 60 (caffeine 65 MG / riboflavin 6.25 MG / thiamine 25 MG / vitamin B 12 0.125 MG / vitamin B6 25 MG Oral Capsule) / 60 (ibuprofen 800 MG Oral Tablet) Pack biolink:Drug
RXNORM:1162789 Hydrocodone / Ibuprofen Pill biolink:Drug
RXNORM:1162788 Hydrocodone / Ibuprofen Oral Product biolink:Drug
RXNORM:405928 Chlorpheniramine / Ibuprofen / Pseudoephedrine Oral Tablet biolink:Drug
I wonder if we can simply this list further so we only compute on a handful of these rather than the huge list.
I'm not sure if this method can remove some of generic concepts in KG2c, but just points out this problem here. I think some of nodes in KG2c (Please see the list below) have generic semantic meaning which might also never appear in a query (eg. MONDO:0004992
which is cancer
and SO:0001217
which is protein_coding_gene
). These nodes normally have extremely high in degree.
Please ignore the accuracy of category column below because the table is summarized from my local version of KG2c which excluded some node types (e.g. biolink:NamedThing
, biolink:MolecularEnitty
) and caused NodeSynonymizer to assign some wrong categories.
curie_id | name | category | indegree | outdegree |
---|---|---|---|---|
SO:0001217 | protein_coding_gene | biolink:Gene | 97419 | 0 |
LOINC:LP208893-0 | Pt | biolink:Procedure | 83179 | 1 |
CHEMBL.COMPOUND:CHEMBL87852 | Hexadecanoic acid (S)-2-hexadecanoyloxy-1-hydr... | biolink:ChemicalSubstance | 59922 | 20571 |
UMLS:C0025255 | Membrane | biolink:GrossAnatomicalStructure | 59623 | 2243 |
CHEMBL.COMPOUND:CHEMBL307679 | Phosphoric acid mono-[5-(4-amino-2-oxo-2H-pyri... | biolink:ChemicalSubstance | 57431 | 35842 |
CHEMBL.COMPOUND:CHEMBL1623949 | biolink:ChemicalSubstance | 54698 | 51477 | |
CHEMBL.COMPOUND:CHEMBL2286758 | 1-palmitoyl-2-(3-trans)-hexadecenoyl-sn-glycer... | biolink:ChemicalSubstance | 50788 | 31647 |
KEGG:C00269 | CDP-diacylglycerol | biolink:Metabolite | 42988 | 30127 |
LOINC:LP7753-9 | Qn | biolink:Procedure | 41873 | 0 |
CHEMBL.COMPOUND:CHEMBL3343985 | Trilinolein | biolink:ChemicalSubstance | 39900 | 11874 |
DRUGBANK:DB03429 | Tetrastearoyl cardiolipin | biolink:ChemicalSubstance | 38460 | 20150 |
MONDO:0000001 | disease or disorder | biolink:Disease | 26125 | 9246 |
UMLS:C0007634 | Cell | biolink:Cell | 25666 | 8887 |
LOINC:LP7751-3 | Ord | biolink:Procedure | 24643 | 0 |
LOINC:LP7567-3 | Ser | biolink:Procedure | 21673 | 0 |
MONDO:0004992 | cancer | biolink:Disease | 21311 | 10623 |
CHEBI:15378 | hydron | biolink:ChemicalSubstance | 21017 | 54538 |
CHEBI:36080 | protein | biolink:Protein | 20927 | 1032 |
CHEMBL.COMPOUND:CHEMBL1098659 | WATER | biolink:ChemicalSubstance | 19740 | 60653 |
PR:000029067 | Homo sapiens protein | biolink:Protein | 19108 | 1 |
PR:000029032 | Mus musculus protein | biolink:Protein | 17115 | 1 |
LOINC:LA4634-7 | Patient | biolink:Procedure | 16877 | 0 |
UMLS:C0040300 | Portion of tissue | biolink:GrossAnatomicalStructure | 16106 | 2477 |
PR:000029045 | Arabidopsis thaliana protein | biolink:Protein | 15834 | 1 |
CHEMBL.COMPOUND:CHEMBL1488784 | SID11113658 | biolink:ChemicalSubstance | 15825 | 16530 |
OMIM:MTHU000046 | Growth | biolink:PhenotypicFeature | 15342 | 2644 |
CHEMBL.COMPOUND:CHEMBL3321993 | TF | biolink:ChemicalSubstance | 14334 | 12663 |
LOINC:LP20667-9 | Ab | biolink:Procedure | 14307 | 0 |
UMLS:C0006104 | Brain | biolink:GrossAnatomicalStructure | 13475 | 686 |
VANDF:4017451 | Liver | biolink:ChemicalSubstance | 13091 | 833 |
LOINC:MTHU000096 | Microbiology | biolink:Procedure | 12785 | 1 |
Here’s an oddball idea: if a bioentity never shows up in any pubmed abstract, it’s probably not “too important.” Wouldn’t get rid of terms like “Microbiology” and “brain”, but would things like “ 1-palmitoyl-2-(3-trans)-hexadecenoyl-sn-glycer...” And just a side note: I think some care will be needed for the generic terms. I have seen SME queries that ask things like “which genes are expressed in the liver?” So we would want that generic term.
That is an interesting question for the FastNGDers (@finnagin @amykglen ?) of the 6.1 million nodes in KG2.5.2C, how many have at least one PMID associated with it in our database? That alone may chop the list down substantially. Although probably not enough. One thing doesn't seem to make sense to me. KG2.5.2 has 10 millions nodes, while KG2.5.2C has 6 million nodes. Not a big drop. Yet, nearly every concept in KG2C that I've cared about has had at least a dozen nodes in the cluster. So this suggests that there are millions of nodes that probably have no friends and I wonder if they're useful.
As an example, I do notice that we have 1.78 million nodes that are just NCBITaxons. I wonder if this is really a useful thing. I wonder if we could remove 1.77 million NCBITaxon nodes without sacrificing any practical query capability..
yeah, I believe only 1.6 million KG2c nodes have one or more PMIDs in the fast NGD database. helps quite a bit for sure, though 1.6m * 1.6m is probably still too much. :)
(and indeed I think the majority of nodes in KG2c are almost never returned in ARAX queries. for example, it's by far the nodes with PMIDs that happen to be returned in ARAX queries; that's why the fastNGD 'hit rate' is in the 99% range, even though only a quarter of the KG2c nodes have any PMIDs in the fastNGD database.)
I have some fanatical programming friends who insist that the smallest possible program that can still do the job is the best one. I wonder if some element of this ethos can be applied to KG2C? What is the smallest possible number of nodes we can have without sacrificing much at all?
good question. :) I believe @timsyoon found that there are about 1.9 million isolated nodes in KG2c. those will of course never be returned in Translator queries, since they're not connected to anything. that's a good chunk right there we could probably get rid of with zero impact!
of course on the flip side, one could argue that those are exactly the kind of nodes that we want to look for edges for. So that they become connected!
Just not the ones that are "ibuprofen 21 mg", "ibuprofen 37 mg" etc.
1.6M^2 = 2.56 trillion shouldn't be too much ;) if we start with those at least and keep track of the hit rate, I think that could work. Just need more silicon to throw at the problem
1.6M^2 = 2.56 trillion shouldn't be too much
@dkoslicki, based on my investigation, running 2.56 trillion in parallel in our server probably needs 139 days and even more under the situation which doesn't affect other users' jobs. We might need more computational resources.
NCATS has provided us funds to do such large scale computations (and thankfully, as opposed to DTD, this database will rarely if ever need to be updated, and even then, the whole thing will not need to be updated, just new entries).
Let me know approximately how many core hours this would take, and I can see what ACI can do for us.
@dkoslicki, I basically used the same approach as what I did for building DTD probability database. For each of 16M nodes, I submitted a job for calculating the ngd score between this node and all other 16M nodes. Each job uses only one process by using the map
function in python (For some reasons, I found that using map
function runs even faster than multiprocessing
.). For each job, it consumes:
User time (seconds): 99.06
System time (seconds): 17.46
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:48.18
Maximum resident set size (kbytes): 17342956 (~17GB)
So I can only run around 25 jobs each time which consumes around 400 - 500 GB RAM. Theoretically, each job only uses one core
and around 17GB but if running multiple jobs at the same time in the same server might affect each other. So I think it would be better if we can get some computational resources from ACI which can automatically assign different jobs to different cores which can afford ~17GB RAM.
@dkoslicki, if you remember, previously we did purchase some virtual cluster from ACI, but until now they can't help us resolve the job allocation problem, which means that we can't submit too many jobs at the same time. Let‘s say if we submit 1000 jobs at the same time, it might causes some problems regarding the job allocations for the vcore.
Let's assume each job doesn't affect each other, each job might cost around 2 minutes for calculating the ngd score between one curie and other 16M nodes. Since we totally have 1,672,684 nodes, we can finish all computations around a week (1672684/ (30 times/per hour x 24 hours/per day x 300) = 7.74 days) if we can submit 300 jobs at the same time.
I'm thinking it would be sensible to try a small-scale experiment to see if the approach yields useful results before starting thousands of hours of computation . Is there already a pilot? Perhaps run just the MONDOs against all 20k Swiss-Prot reviewed proteins? Can we reproduce some known connections? Can we generate some plausible new ones that we would want to report?
Perhaps run just the MONDOs against all 20k Swiss-Prot reviewed proteins?
Great idea, @edeutsch. I can have a try.
Can we reproduce some known connections? Can we generate some plausible new ones that we would want to report?
Hi @edeutsch, I have already computed the ngd scores of all MONDOs against all 20k Swiss-Prot reviewed proteins. How can we know if it can reproduce some known connections? Or generate some plausible new ones? Is there a threshold to filter them for checking some known connections?
Can you post a plot and some summary statistics of the NGDs that you calculated @chunyuma? That will help in determining what constitutes a meaningfully "small" NGD score
Agreed, and I also think that looking at a few examples would be useful.
Example 1: MONDO:0013989
{
"edges": {
"e00": {
"subject": "n00",
"object": "n01"
}
},
"nodes": {
"n00": {
"ids": ["MONDO:0013989"]
},
"n01": {
"categories": ["biolink:Protein"]
}
}
}
Current ARAX is returning 69 results, but only the top 5 have NGDs. The rest, no NGDs. What are the top 50 proteins for MONDO:0013989 based on your calculation? Do they overlap with the current answer?
2) Example of a current case where we have nothing: MONDO:0014001:
{
"edges": {
"e00": {
"subject": "n00",
"object": "n01"
}
},
"nodes": {
"n00": {
"ids": ["MONDO:0014001"]
},
"n01": {
"categories": ["biolink:Protein"]
}
}
}
This returns nothing. What are the top 50 NGD links from your computation? Are there any?
Thanks @dkoslicki and @edeutsch. Based on curie_to_pmids_v1.0_KG2.6.3.sqlite
database, there are total 22,464 UniProKB
proteins and 13,689 MONDO
curies.
Here are statistics of the NGDs calculation:
Only 11,743 MONDO
curies have at least one valid ngd score.
Only 20,467 UniProKB
curies have at least one valid ngd score.
count 4.156227e+07 mean 3.728171e-01 std 1.346087e-01 min 2.382131e-03 25% 2.755543e-01 50% 3.507656e-01 75% 4.456183e-01 max 1.204312e+00
Here is the distribution of all NGD scores for all MONDOs against all 20k Swiss-Prot reviewed proteins
For example 1: MONDO:0013989, here are the top 50 proteins:
MONDO | protein | ngd_score |
---|---|---|
MONDO:0013989 | UniProtKB:Q6UVM3 | 0.135096 |
MONDO:0013989 | UniProtKB:Q15822 | 0.139354 |
MONDO:0013989 | UniProtKB:Q9H936 | 0.156994 |
MONDO:0013989 | UniProtKB:P17787 | 0.161801 |
MONDO:0013989 | UniProtKB:Q9P2E7 | 0.175193 |
MONDO:0013989 | UniProtKB:O43526 | 0.180483 |
MONDO:0013989 | UniProtKB:O76039 | 0.184613 |
MONDO:0013989 | UniProtKB:O43307 | 0.203182 |
MONDO:0013989 | UniProtKB:P61764 | 0.203552 |
MONDO:0013989 | UniProtKB:Q86Y07 | 0.207103 |
MONDO:0013989 | UniProtKB:Q07699 | 0.215191 |
MONDO:0013989 | UniProtKB:Q8N7X2 | 0.215684 |
MONDO:0013989 | UniProtKB:Q96MP8 | 0.216447 |
MONDO:0013989 | UniProtKB:Q5RIA9 | 0.218315 |
MONDO:0013989 | UniProtKB:Q9H1X3 | 0.218315 |
MONDO:0013989 | UniProtKB:Q9H2S1 | 0.218374 |
MONDO:0013989 | UniProtKB:Q9P2G4 | 0.220513 |
MONDO:0013989 | UniProtKB:Q96H35 | 0.220513 |
MONDO:0013989 | UniProtKB:Q96MA6 | 0.222343 |
MONDO:0013989 | UniProtKB:Q13303 | 0.223599 |
MONDO:0013989 | UniProtKB:Q9NX38 | 0.223663 |
MONDO:0013989 | UniProtKB:Q9BS92 | 0.224072 |
MONDO:0013989 | UniProtKB:Q5VVW2 | 0.224072 |
MONDO:0013989 | UniProtKB:Q3KQV9 | 0.224072 |
MONDO:0013989 | UniProtKB:Q5JVG2 | 0.224072 |
MONDO:0013989 | UniProtKB:Q96LW7 | 0.224072 |
MONDO:0013989 | UniProtKB:Q86W47 | 0.225216 |
MONDO:0013989 | UniProtKB:Q8NBV4 | 0.225563 |
MONDO:0013989 | UniProtKB:Q5THR3 | 0.225563 |
MONDO:0013989 | UniProtKB:O75121 | 0.225563 |
MONDO:0013989 | UniProtKB:Q5VXU9 | 0.225563 |
MONDO:0013989 | UniProtKB:Q6ZW05 | 0.226913 |
MONDO:0013989 | UniProtKB:Q14929 | 0.226913 |
MONDO:0013989 | UniProtKB:Q8N228 | 0.226913 |
MONDO:0013989 | UniProtKB:Q96K62 | 0.226913 |
MONDO:0013989 | UniProtKB:A2A3K4 | 0.226913 |
MONDO:0013989 | UniProtKB:Q9P2F6 | 0.226913 |
MONDO:0013989 | UniProtKB:Q6NUM6 | 0.226913 |
MONDO:0013989 | UniProtKB:Q56UQ5 | 0.227683 |
MONDO:0013989 | UniProtKB:Q8N0Z9 | 0.227683 |
MONDO:0013989 | UniProtKB:Q6ZMW2 | 0.227683 |
MONDO:0013989 | UniProtKB:Q96NJ1 | 0.227683 |
MONDO:0013989 | UniProtKB:Q8NFD4 | 0.227683 |
MONDO:0013989 | UniProtKB:Q6P2C0 | 0.227683 |
MONDO:0013989 | UniProtKB:Q5T011 | 0.227914 |
MONDO:0013989 | UniProtKB:Q5VTE6 | 0.228149 |
MONDO:0013989 | UniProtKB:Q6PF06 | 0.228149 |
MONDO:0013989 | UniProtKB:Q8N4T4 | 0.228149 |
MONDO:0013989 | UniProtKB:Q9Y2H8 | 0.228149 |
MONDO:0013989 | UniProtKB:Q6ZSA7 | 0.228149 |
For those top 5 with NGD returned by ARAX, only UniProtKB:Q6UVM3
is matched. For some reasons, UniProtKB:P78508
and UniProtKB:Q9NS40
are not in curie_to_pmids_v1.0_KG2.6.3.sqlite
database. I guess probably ARAX is still using the old version of kg2 rather than 2.6.3.
For example 2: MONDO:0014001, it also doesn't have any ngd scores with any proteins. ARAX also reports an error No paths were found in {'BTE', 'RTX-KG2'} satisfying qedge e00
when I ran:
{
"edges": {
"e00": {
"subject": "n00",
"object": "n01"
}
},
"nodes": {
"n00": {
"ids": ["MONDO:0014001"]
},
"n01": {
"categories": ["biolink:Protein"]
}
}
}
@chunyuma would you generate the histogram with 0.01 NGD score resolution?
@edeutsch, here is the histogram with 0.01 resolution:
For some reasons, UniProtKB:P78508 and UniProtKB:Q9NS40 are not in curie_to_pmids_v1.0_KG2.6.3.sqlite database. I guess probably ARAX is still using the old version of kg2 rather than 2.6.3.
ARAX is still using 2.5.2 since there are still too many issues with the 2.6.x series to deploy I think.
but I'm concerned about P78508. Are you saying that P78508 is not in KG2.6.3? Or there are no PMIDs associated with it?
Either way, this seems concerning and something we should follow up on? P78508 is a classic reviewed UniProtKB/Swiss-Prot protein, available since 1997 with many publications associated with it in UniProtKB. If we lost it, we should figure out why.
Are you saying that P78508 is not in KG2.6.3? Or there are no PMIDs associated with it?
I think v2.6.3 Nodesynonymizer clustered UniProtKB:P78508
with MONDO:0010134
. And it seems like MONDO:0010134
also doesn't have PMIDs.
n.id | n.category | n.equivalent_curies | n.publications |
---|---|---|---|
"MONDO:0010134" | "biolink:Disease" | ["CHEMBL.TARGET:CHEMBL2146348", "DOID:0060744", "ENSEMBL:ENSG00000091137", "ENSEMBL:ENSG00000168269", "ENSEMBL:ENSG00000177807", "HGNC:3815", "HGNC:6256", "HGNC:8818", "LOINC:LP35578-1", "MEDDRA:10080398", "MESH:C536648", "MONDO:0010134", "NCBIGene:2299", "NCBIGene:3766", "NCBIGene:5172", "NCIT:C121745", "OMIM:274600", "OMIM:601093", "OMIM:602208", "OMIM:605646", "ORPHANET:231422", "ORPHANET:705", "PR:000001979", "PR:000007625", "PR:P78508", "PR:Q12951", "REACT:R-HSA-425403", "REACT:R-HSA-5627850", "REACT:R-HSA-5627857", "REACT:R-HSA-5627860", "REACT:R-HSA-5627865", "REACT:R-HSA-5627873", "REACT:R-HSA-975290", "SNOMED:70348004", "UMLS:C0271829", "UMLS:C1414682", "UMLS:C1416577", "UMLS:C1418445", "UMLS:C3551785", "UniProtKB:O43511", "UniProtKB:P78508", "UniProtKB:Q12951"] | ["2-r", "DOI:10.1001/jamaoto.2013.4185", "DOI:10.1002/(sici)1096-8628(20000103)90:1<38::aid-ajmg8>3.0.co", "DOI:10.1002/ajmg.a.20272", "DOI:10.1002/humu.1116", "DOI:10.1002/humu.1238", "DOI:10.1002/humu.20884", "DOI:10.1002/humu.23335", "DOI:10.1002/humu.9043", "DOI:10.1002/j.1460-2075.1994.tb06827.x"] |
hmm, I suggest doing your experiment with KG2.5.2 because otherwise we will keep bumping into these KG2.6.x problems when we try to poke a little deeper. and it will be hard to compare what ARAX can currently produce to understand if we're getting an improvement.
ok, I can do it and should have results tomorrow or the day after tomorrow.
Based on KG2.5.2 NGD database, there are total 24,424 UniProKB proteins and 11,732 MONDO curies.
Here are statistics of the NGDs calculation: Only 9,375 MONDO curies have at least one valid ngd score. Only 22,136 UniProKB curies have at least one valid ngd score.
count 1.936320e+07 mean 3.185257e-01 std 1.480961e-01 min 2.114451e-03 25% 2.111894e-01 50% 2.895769e-01 75% 3.901364e-01 max 1.222913e+00
Here is the distribution of all NGD scores for all MONDOs against all 20k Swiss-Prot reviewed proteins
For example 1: MONDO:0013989, here are the top 50 proteins:
MONDO | protein | ngd_score |
---|---|---|
MONDO:0013989 | UniProtKB:Q6UVM3 | 0.131265 |
MONDO:0013989 | UniProtKB:Q96H35 | 0.161792 |
MONDO:0013989 | UniProtKB:Q8N7X2 | 0.161792 |
MONDO:0013989 | UniProtKB:Q5RIA9 | 0.161792 |
MONDO:0013989 | UniProtKB:Q9H1X3 | 0.163765 |
MONDO:0013989 | UniProtKB:Q8N9H8 | 0.165414 |
MONDO:0013989 | UniProtKB:Q9P2G4 | 0.165414 |
MONDO:0013989 | UniProtKB:Q96GE9 | 0.165414 |
MONDO:0013989 | UniProtKB:Q8IYX7 | 0.165414 |
MONDO:0013989 | UniProtKB:Q14929 | 0.165414 |
MONDO:0013989 | UniProtKB:Q96J77 | 0.165414 |
MONDO:0013989 | UniProtKB:Q9Y2H8 | 0.165414 |
MONDO:0013989 | UniProtKB:Q86YN1 | 0.166834 |
MONDO:0013989 | UniProtKB:Q8NE28 | 0.166834 |
MONDO:0013989 | UniProtKB:Q5W0U4 | 0.166834 |
MONDO:0013989 | UniProtKB:Q9UGQ2 | 0.166834 |
MONDO:0013989 | UniProtKB:Q5VVW2 | 0.166834 |
MONDO:0013989 | UniProtKB:Q9BS92 | 0.166834 |
MONDO:0013989 | UniProtKB:Q9P2J8 | 0.166834 |
MONDO:0013989 | UniProtKB:Q96E40 | 0.166834 |
MONDO:0013989 | UniProtKB:Q5JVG2 | 0.166834 |
MONDO:0013989 | UniProtKB:Q5VXU9 | 0.166834 |
MONDO:0013989 | UniProtKB:A2A3K4 | 0.166834 |
MONDO:0013989 | UniProtKB:Q9P2F6 | 0.166834 |
MONDO:0013989 | UniProtKB:Q6PF06 | 0.166834 |
MONDO:0013989 | UniProtKB:Q3KQV9 | 0.166834 |
MONDO:0013989 | UniProtKB:Q8NBV4 | 0.166834 |
MONDO:0013989 | UniProtKB:Q5T6V5 | 0.168084 |
MONDO:0013989 | UniProtKB:Q86XA9 | 0.168084 |
MONDO:0013989 | UniProtKB:Q8TF39 | 0.168084 |
MONDO:0013989 | UniProtKB:Q9P2P1 | 0.168084 |
MONDO:0013989 | UniProtKB:Q5TYW1 | 0.169202 |
MONDO:0013989 | UniProtKB:Q96LW7 | 0.169202 |
MONDO:0013989 | UniProtKB:Q9Y6Q3 | 0.169202 |
MONDO:0013989 | UniProtKB:Q6IPU0 | 0.169202 |
MONDO:0013989 | UniProtKB:Q8N4T4 | 0.169202 |
MONDO:0013989 | UniProtKB:Q9NVG8 | 0.169202 |
MONDO:0013989 | UniProtKB:Q5VST6 | 0.169202 |
MONDO:0013989 | UniProtKB:Q8N5N7 | 0.169202 |
MONDO:0013989 | UniProtKB:O94769 | 0.170215 |
MONDO:0013989 | UniProtKB:Q96GR4 | 0.170215 |
MONDO:0013989 | UniProtKB:Q9P2D6 | 0.170215 |
MONDO:0013989 | UniProtKB:Q9P2N2 | 0.170215 |
MONDO:0013989 | UniProtKB:Q8NCR6 | 0.170215 |
MONDO:0013989 | UniProtKB:Q9NVS9 | 0.170739 |
MONDO:0013989 | UniProtKB:Q4ADV7 | 0.171006 |
MONDO:0013989 | UniProtKB:Q712K3 | 0.171142 |
MONDO:0013989 | UniProtKB:Q8N539 | 0.171142 |
MONDO:0013989 | UniProtKB:Q9Y614 | 0.171142 |
MONDO:0013989 | UniProtKB:Q6ZV29 | 0.171142 |
We can match to the top 3 proteins out of top 5 with NGDs returned by ARAX.
For example 2: MONDO:0014001, it also doesn't have any ngd scores with any proteins.
Only 9,375 MONDO curies have at least one valid ngd score.
1) So of these 9375, can you determine how many of these have 0, 1, 2, 3+ known KG2C edges to proteins?
2) Of the 9375 MONDO curies that have 1 or 2 known KG2C edges to proteins, for how many (and which ones) does the NGD method reproduce them?
3) How many (and which) of the 9375 have 0 known KG2C edges to proteins?
4) Can you point to an example where this method finds a MONDO to UniProtKB NGD association that does not exist in KG2C, but that can be verified as reasonable by reading one of the implicated papers or by some other means? i.e., can you find an example that demonstrates that this approach really finds something valuable?
thanks!
- So of these 9375, can you determine how many of these have 0, 1, 2, 3+ known KG2C edges to proteins?
None of these 9375 has 0 known KG2C edges to proteins 765 out of 9375 have 1 667 out of 9375 have 2 5686 have 3+
- Of the 9375 MONDO curies that have 1 or 2 known KG2C edges to proteins, for how many (and which ones) does the NGD method reproduce them?
Of the 9375 MONDO curies that have 1 or 2 known KG2C edges to proteins, there are 1217 MONDO curies which have at least one MONDO-protein pair that is in KG2c and can be reproduced by the NGD method. For these MONDO curies, there are total 1428 MONDO-protein pairs. Since they are too many, I'm not listing them here.
- How many (and which) of the 9375 have 0 known KG2C edges to proteins?
I guess you're asking how many of these 9375 have 0 known KG2C edges to proteins that the NGD method reproduce? Otherwise, it should be the same as the question 1. There are 346 out of 9375 which have 0 known KG2C edges to proteins that the NGD method produces.
Here is the list of them:
['MONDO:0008824',
'MONDO:0013493',
'MONDO:0017449',
'MONDO:0002411',
'MONDO:0008117',
'MONDO:0016032',
'MONDO:0018543',
'MONDO:0007361',
'MONDO:0009309',
'MONDO:0006688',
'MONDO:0003197',
'MONDO:0015675',
'MONDO:0010657',
'MONDO:0011818',
'MONDO:0019214',
'MONDO:0002027',
'MONDO:0008537',
'MONDO:0017426',
'MONDO:0002158',
'MONDO:0016368',
'MONDO:0016242',
'MONDO:0011842',
'MONDO:0012237',
'MONDO:0012081',
'MONDO:0014226',
'MONDO:0011224',
'MONDO:0008269',
'MONDO:0018448',
'MONDO:0013714',
'MONDO:0008482',
'MONDO:0002967',
'MONDO:0003147',
'MONDO:0024519',
'MONDO:0001053',
'MONDO:0013385',
'MONDO:0016567',
'MONDO:0016707',
'MONDO:0004848',
'MONDO:0000115',
'MONDO:0020124',
'MONDO:0011313',
'MONDO:0054698',
'MONDO:0013573',
'MONDO:0017593',
'MONDO:0019807',
'MONDO:0008148',
'MONDO:0000754',
'MONDO:0024463',
'MONDO:0024456',
'MONDO:0006696',
'MONDO:0008990',
'MONDO:0010020',
'MONDO:0009433',
'MONDO:0006821',
'MONDO:0003633',
'MONDO:0036591',
'MONDO:0001235',
'MONDO:0008679',
'MONDO:0006008',
'MONDO:0010367',
'MONDO:0006771',
'MONDO:0006850',
'MONDO:0016991',
'MONDO:0002523',
'MONDO:0009368',
'MONDO:0019725',
'MONDO:0009970',
'MONDO:0007001',
'MONDO:0007636',
'MONDO:0020204',
'MONDO:0005743',
'MONDO:0010780',
'MONDO:0019371',
'MONDO:0002839',
'MONDO:0021804',
'MONDO:0014711',
'MONDO:0004112',
'MONDO:0011870',
'MONDO:0015009',
'MONDO:0004666',
'MONDO:0011342',
'MONDO:0056795',
'MONDO:0004845',
'MONDO:0005640',
'MONDO:0010997',
'MONDO:0006616',
'MONDO:0006996',
'MONDO:0017304',
'MONDO:0020542',
'MONDO:0007122',
'MONDO:0004633',
'MONDO:0004866',
'MONDO:0008939',
'MONDO:0009588',
'MONDO:0011018',
'MONDO:0013343',
'MONDO:0020381',
'MONDO:0004672',
'MONDO:0007723',
'MONDO:0005731',
'MONDO:0002920',
'MONDO:0011162',
'MONDO:0005624',
'MONDO:0021169',
'MONDO:0001074',
'MONDO:0002688',
'MONDO:0019078',
'MONDO:0001404',
'MONDO:0015304',
'MONDO:0016979',
'MONDO:0021020',
'MONDO:0012173',
'MONDO:0018456',
'MONDO:0015053',
'MONDO:0009054',
'MONDO:0003741',
'MONDO:0007781',
'MONDO:0021366',
'MONDO:0015522',
'MONDO:0011891',
'MONDO:0013099',
'MONDO:0010302',
'MONDO:0003182',
'MONDO:0016426',
'MONDO:0007105',
'MONDO:0007543',
'MONDO:0007662',
'MONDO:0000741',
'MONDO:0018631',
'MONDO:0032644',
'MONDO:0001797',
'MONDO:0018466',
'MONDO:0008547',
'MONDO:0005909',
'MONDO:0019497',
'MONDO:0017160',
'MONDO:0018170',
'MONDO:0006534',
'MONDO:0008263',
'MONDO:0005460',
'MONDO:0011462',
'MONDO:0010490',
'MONDO:0012731',
'MONDO:0020298',
'MONDO:0013400',
'MONDO:0020300',
'MONDO:0003701',
'MONDO:0011806',
'MONDO:0006481',
'MONDO:0010571',
'MONDO:0020944',
'MONDO:0001854',
'MONDO:0000750',
'MONDO:0008292',
'MONDO:0015048',
'MONDO:0009537',
'MONDO:0020507',
'MONDO:0020713',
'MONDO:0012277',
'MONDO:0011907',
'MONDO:0014084',
'MONDO:0013843',
'MONDO:0014070',
'MONDO:0006629',
'MONDO:0007878',
'MONDO:0014937',
'MONDO:0009595',
'MONDO:0006891',
'MONDO:0000426',
'MONDO:0012651',
'MONDO:0019448',
'MONDO:0001935',
'MONDO:0014684',
'MONDO:0019967',
'MONDO:0019780',
'MONDO:0008954',
'MONDO:0007709',
'MONDO:0007798',
'MONDO:0018214',
'MONDO:0005969',
'MONDO:0008953',
'MONDO:0011452',
'MONDO:0007796',
'MONDO:0019951',
'MONDO:0019642',
'MONDO:0011921',
'MONDO:0013150',
'MONDO:0005667',
'MONDO:0012157',
'MONDO:0009415',
'MONDO:0004139',
'MONDO:0018690',
'MONDO:0024610',
'MONDO:0001431',
'MONDO:0045019',
'MONDO:0009624',
'MONDO:0021839',
'MONDO:0007791',
'MONDO:0006605',
'MONDO:0008593',
'MONDO:0005945',
'MONDO:0020366',
'MONDO:0007415',
'MONDO:0004349',
'MONDO:0013577',
'MONDO:0060690',
'MONDO:0001834',
'MONDO:0007722',
'MONDO:0001600',
'MONDO:0011413',
'MONDO:0004638',
'MONDO:0019374',
'MONDO:0005910',
'MONDO:0011546',
'MONDO:0014219',
'MONDO:0012497',
'MONDO:0007946',
'MONDO:0017825',
'MONDO:0008693',
'MONDO:0015273',
'MONDO:0007454',
'MONDO:0002354',
'MONDO:0011866',
'MONDO:0001915',
'MONDO:0008666',
'MONDO:0011139',
'MONDO:0011374',
'MONDO:0008637',
'MONDO:0000859',
'MONDO:0002102',
'MONDO:0011932',
'MONDO:0006995',
'MONDO:0018045',
'MONDO:0013288',
'MONDO:0020352',
'MONDO:0006711',
'MONDO:0010142',
'MONDO:0012611',
'MONDO:0014255',
'MONDO:0005753',
'MONDO:0000966',
'MONDO:0018198',
'MONDO:0008334',
'MONDO:0015748',
'MONDO:0019804',
'MONDO:0016418',
'MONDO:0009870',
'MONDO:0019677',
'MONDO:0001479',
'MONDO:0009728',
'MONDO:0008332',
'MONDO:0008722',
'MONDO:0007990',
'MONDO:0044768',
'MONDO:0001801',
'MONDO:0020356',
'MONDO:0009424',
'MONDO:0006447',
'MONDO:0008102',
'MONDO:0060593',
'MONDO:0000158',
'MONDO:0008105',
'MONDO:0001830',
'MONDO:0014178',
'MONDO:0007867',
'MONDO:0014592',
'MONDO:0016256',
'MONDO:0007377',
'MONDO:0018604',
'MONDO:0020843',
'MONDO:0010704',
'MONDO:0021941',
'MONDO:0016489',
'MONDO:0017086',
'MONDO:0005190',
'MONDO:0009953',
'MONDO:0008230',
'MONDO:0013049',
'MONDO:0019758',
'MONDO:0013781',
'MONDO:0009428',
'MONDO:0008371',
'MONDO:0014984',
'MONDO:0014945',
'MONDO:0004838',
'MONDO:0007007',
'MONDO:0008004',
'MONDO:0016225',
'MONDO:0022963',
'MONDO:0010779',
'MONDO:0005829',
'MONDO:0010149',
'MONDO:0016557',
'MONDO:0015275',
'MONDO:0006986',
'MONDO:0002962',
'MONDO:0015128',
'MONDO:0016001',
'MONDO:0002332',
'MONDO:0018597',
'MONDO:0022236',
'MONDO:0013930',
'MONDO:0013824',
'MONDO:0018521',
'MONDO:0024337',
'MONDO:0007677',
'MONDO:0005912',
'MONDO:0018784',
'MONDO:0007741',
'MONDO:0005787',
'MONDO:0008635',
'MONDO:0017413',
'MONDO:0001301',
'MONDO:0043310',
'MONDO:0024546',
'MONDO:0010137',
'MONDO:0009628',
'MONDO:0009733',
'MONDO:0012839',
'MONDO:0003127',
'MONDO:0006569',
'MONDO:0005774',
'MONDO:0012399',
'MONDO:0001297',
'MONDO:0008705',
'MONDO:0008736',
'MONDO:0014160',
'MONDO:0010880',
'MONDO:0006638',
'MONDO:0010884',
'MONDO:0012723',
'MONDO:0006766',
'MONDO:0021334',
'MONDO:0013564',
'MONDO:0020722',
'MONDO:0005757',
'MONDO:0002516',
'MONDO:0044740',
'MONDO:0015534',
'MONDO:0002968',
'MONDO:0017560',
'MONDO:0004348',
'MONDO:0007338',
'MONDO:0008180',
'MONDO:0021140',
'MONDO:0015016']
- Can you point to an example where this method finds a MONDO to UniProtKB NGD association that does not exist in KG2C, but that can be verified as reasonable by reading one of the implicated papers or by some other means? i.e., can you find an example that demonstrates that this approach really finds something valuable?
This might need more time to do investigation.
great, thanks, this looks promising!
So of these 9375, can you determine how many of these have 0, 1, 2, 3+ known KG2C edges to proteins?
None of these 9375 has 0 known KG2C edges to proteins 765 out of 9375 have 1 667 out of 9375 have 2 5686 have 3+
hmm, but 765 + 667 + 5686 = 7118 . Where are the other (9375-7118) = 2257?
@edeutsch, sorry, my mistake. The Cypher query didn't return the MONDO curies with 0 known connected protein in kg2c. So the rest 2257 don't have any known KG2C edges to proteins.
ah, yes, that seems closer to what I expected. 0 would have been (was) very surprising.
So then question 3 is still relevant. The first part of the answer is 2257. The second part is which ones (I suppose that's a very long list). So more importantly, can you find examples in the 2257 that are demonstrably good? Or demonstrably bad? I suppose it would be useful to pick ~5 at random, and examine them carefully by looking at the returned PMIDs. Are a NGD results A: good, B: apparently bad, C: can't tell.
@edeutsch, I think some of them still have many PMIDs. I randomly picked 50 and here is a table summarizing their PMIDs:
curie | pmids | num_pmids |
---|---|---|
MONDO:0024503 | [28490723, 29970437, 30373866, 30595757, 23091... | 7 |
MONDO:0002433 | [23240704, 13795331, 9232390, 7700489, 9773066... | 3955 |
MONDO:0017385 | [30525185, 31054119, 25524840, 29037447, 26784... | 12 |
MONDO:0016668 | [27423233, 3750401, 871430, 23218697, 7573002,... | 195 |
MONDO:0019079 | [31516794, 25497877, 30061431] | 3 |
MONDO:0016736 | [25432191, 30783393, 28904578, 31250151, 23163... | 16 |
MONDO:0007426 | [9164800, 7632899, 23681028, 8534023, 19976200... | 927 |
MONDO:0009105 | [31132033, 29383842, 26526116, 27050310, 28944... | 9 |
MONDO:0015986 | [23573507, 7198213, 27123211, 3920396, 7304721... | 147 |
MONDO:0024674 | [4414979, 14290436, 7303684, 28913160, 2963457... | 154 |
MONDO:0008641 | [29322432, 30627749, 29386495] | 3 |
MONDO:0005949 | [18821121, 21450753, 23955460, 16918534, 17147... | 1090 |
MONDO:0002518 | [25351203, 20446223, 24106418, 9244117, 294947... | 6 |
MONDO:0016362 | [29169633, 22933892, 27753051, 23891684] | 4 |
MONDO:0000367 | [21749764, 23240712, 16236553, 20643854, 20807... | 2167 |
MONDO:0003043 | [27588097, 28258179, 8774664, 24716041, 106269... | 58 |
MONDO:0016680 | [28120069, 20975976, 30604394, 29760590, 27578... | 8 |
MONDO:0016595 | [29263873, 26903555, 25454087, 13319688, 22291... | 200 |
MONDO:0022513 | [24636648, 16476392, 28389037, 30847185, 81134... | 6 |
MONDO:0043218 | [27465216, 31438759, 30323116, 30797209, 31060... | 5 |
MONDO:0018755 | [29799424, 27912864, 28655685, 25984198, 29657... | 11 |
MONDO:0007861 | [7888390, 9678474, 9014285, 11506318, 19116567... | 25 |
MONDO:0018892 | [2598528, 30451208, 2191247, 17182351, 1157569... | 37 |
MONDO:0017790 | [28275686, 29968043, 31594255, 29103540, 26363... | 6 |
MONDO:0006271 | [8272897, 27990273, 9083523, 26046099, 2704437... | 34 |
MONDO:0006607 | [14237705, 13457422, 13457423, 13076496, 10586... | 377 |
MONDO:0019498 | [12340744, 29264904, 28293130, 19831312, 93646... | 148 |
MONDO:0044354 | [26622464, 23818241, 30034819, 23955459, 27142... | 82 |
MONDO:0006922 | [8421890, 15297539, 21964802, 25416710, 447898... | 315 |
MONDO:0043988 | [15334402, 14719363, 23342212, 10980741, 21467... | 54 |
MONDO:0029001 | [27366016, 27307137, 28578820, 23966726, 27528... | 49 |
MONDO:0011324 | [28988429, 20949527] | 2 |
MONDO:0018212 | [25500256] | 1 |
MONDO:0021017 | [31091456, 27918210, 30185603, 28024462, 26667... | 43 |
MONDO:0021060 | [25487361, 31825160, 23379592, 30693642, 30041... | 69 |
MONDO:0003209 | [28506304, 25861345, 26581569, 17721187, 28888... | 19 |
MONDO:0009493 | [30863896, 23954873, 6425460, 23954222] | 4 |
MONDO:0018058 | [28178944, 24113157, 10549768, 23901199, 22470... | 85 |
MONDO:0005762 | [18393603, 9893380, 9893381, 9893382, 9893383,... | 198 |
MONDO:0001594 | [10205697, 2351626, 10036748, 17409040, 192993... | 164 |
MONDO:0006455 | [28168064, 10779032, 18480395, 25979154, 18804... | 7 |
MONDO:0003125 | [31147264, 22525408, 24518791, 20850634, 24755... | 10 |
MONDO:0002168 | [13669698, 9762889, 16568146, 2323283, 2371225... | 6 |
MONDO:0004088 | [31393622, 20923443, 30695899, 24464879] | 4 |
MONDO:0004491 | [2389106, 16720931] | 2 |
MONDO:0003928 | [26866354] | 1 |
MONDO:0003051 | [9653909] | 1 |
MONDO:0011596 | [16912508] | 1 |
MONDO:0000763 | [19236704] | 1 |
MONDO:0013446 | [11283794] | 1 |
And it seems like some of them have pretty good NGD results:
MONDO | protein | ngd_score |
---|---|---|
MONDO:0009105 | UniProtKB:Q6PGP7 | 0.115045 |
MONDO:0009493 | UniProtKB:P03901 | 0.166939 |
MONDO:0000763 | UniProtKB:Q9GZX3 | 0.185350 |
MONDO:0009105 | UniProtKB:Q9BYG5 | 0.190188 |
MONDO:0021060 | UniProtKB:Q15814 | 0.197264 |
... | ... | ... |
MONDO:0002433 | UniProtKB:Q07812 | 0.867421 |
MONDO:0007426 | UniProtKB:P01375 | 0.890680 |
MONDO:0002433 | UniProtKB:P15692 | 0.894198 |
MONDO:0002433 | UniProtKB:P05019 | 0.915273 |
MONDO:0002433 | UniProtKB:P29474 | 0.925709 |
Spot checking that first pair on the bottom table with an ngd of 0.115045 looks good:
https://www.pombase.org/term/MONDO:0009105 - tricho-hepato-enteric syndrome https://www.uniprot.org/uniprot/Q6PGP7 - Tricho-hepatic-enteric syndrome protein
the protein name contains the name of the syndrome
Thanks @chunyuma @finnagin I think this is showing great promise. I think it would be useful to pick a few more such cases at random and document here whether they seem likely good or not. @finnagin 's example above is clearly very good. Would you spot check a few more?
For this checked one above, there are also 21 results found by ARAX that are from BTE I think: https://arax.ncats.io/?r=12764 so that seems verified. But not novel (maybe novel from KG2, but not novel from BTE.
It would be fun to find an example for which we cannot find any known edges in KG2 or BTE but yet we find one or more NGD associations that can be verified as being plausible by looking at the publications
Assuming that pans out well, then I think we have the evidence to go forward with finishing implementation in Expand() How would be implement it? I'm thinking it would be useful to store the top 50 hits between every MONDO:x and UniProtKB:x identifiers in a SQLite database and then figure out how to query it as a data source.
Would we always call it as another data source? or would be only call it if we came up try after checking "real" data sources? I don't know, but let's try stuff!
While we're doing that, it would be great to try to expand the set to every MONDO:x to DRUGBANK:x identifier. And maybe DRUGBANK:x to UniProtKB:x. And then again find a few examples of node pairs that we could not previously connect at all that come up with apparently good hits and validate them.
thanks!
Breaking out from #1345
I'm thinking a fun project for someone in the future:
?