Closed dkoslicki closed 4 months ago
Summary of mini hackathon:
assume the QG from the ARS will simply look like:
{
"edges": [
{
"id": "qg2",
"source_id": "qg1",
"target_id": "qg0"
}
],
"nodes": [
{
"id": "qg0",
"curie": "DOID:1234"
},
{
"id": "qg1",
"type": "chemical_substance"
}
]
}
Approach A) (classic current 1-hop lookup) DOID:1234 ---- chemical_substance Expand using KG2: gives known indicated/contraindicated Can use drug-treats-disease to rank/score these
Approach B) (????) DOID:1234 ---- chemical_substance Expand using predict-drug-treats-disease model database to find the top N chemical_substances Could use NGD to rank/score
Approach C) (needs qedge_qualifier=uncommon) QG: DOID:1234 ---- chemical_substance (could have qedge_qualifier=uncommon/speculative)
treats/indicated_for
connections from KG2 Expand (aggressively means delete even if there are other edges)contraindicated_for
, hit up COHD to see if doctors are actually using these to treat patients, if not, remove these
Approach D) (classic 1-hop, but “exhaustive” query option) QG: DOID:1234 ---- chemical_substance
contraindicated_for
, hit up COHD to see if doctors are actually using these to treat patients,
Approach F)
DOID:1234 ---- protein ----- chemical_substance
\ /
\---------- delete --------------/
Action items: create issues for:
I'm not sure if the predict-drug-treats-disease model database is feasible because it will be so big. Based on my rough calculation (only 539,840,312 drug-disease pairs occupy 28GB), the final database might have around 5TB which involves 388,966 drugs and 322,871 diseases (125,585,841,386 pairs). If we're ok for this size, do we have any servers to store this file?
@chunyuma I see a couple ways to proceed:
In both cases, you will still need to compute all 125B probabilities, but after doing 1 above, we should only need to save a fraction of those (loop over all pairs, discard above threshold, only store ones above threshold = never need to have a 5TB intermediate data structure).
Also, note: we will want to be able to do expands via PDTD (probability_drug_treats_disease
) in both directions: (specific dug) —> (arbitrary disease)
and (specific disease)—>(arbitrary drug)
, so you may need to create two versions of the data (or two relational databases: each that when given a specific drug or disease, returns all the diseases or drugs respectively above the threshold, along with the actual probability_treats).
@dkoslicki, I think one relational database is sufficient for both directions because of two reasons:
The structure of database will be like: drug
, disease
, probability
. So if we set 0.8 as threshold for probability
, then it should filter out all pairs below 0.8 for both directions.
Since we used Hadamard product, it doesn't matter which is source
or which is target
.
@chunyuma Sounds good! As long as we can do fast lookups in both directions, whatever database you use is fine by me!
Update: Regarding the build of the predict-drug-treats-disease model database, we might need around 25 days to calculate the probabilities of all drug-disease pairs since we have 388,966 drugs and 322,871 diseases ( around 125B pairs).
I separated 322,871 diseases with all drug pairs into 1468 batches each having 220 diseases. Calculating the probabilities of one batch needs around 1515.644 seconds (each drug-disease pair needs (1515.6438915729523*1000)/(388966*220)=0.01771
millisecond). Therefore, to finish all drug-disease pairs, we needs 1468/((3600/1515.64)*24)=25.75
days.
Idea: only make predictions for drugs that have a synonym with a curie with a prefix from a specific list TBD (eg. ignore CUI, UMLS, etc., but use CHEMBL and/or CHEBI)
...or in the canonicalized KG2, only use nodes where the curie itself is CHEMBL, CHEBI, or DRUGBANK or something. Maybe we could look at how many drugs/chemical_substances in KG2canon are there for each CURIE prefix? e.g.: CHEMBL.COMPOUND: 15022 CHEMBI: 345 DRUGBANK: 923 CUI: 2194287 that might help influence the decision..
here are some counts from the trial full canonicalized build (includes drugs and chemical_substances):
[('CHEMBL.COMPOUND', 1820253),
('UMLS', 193599),
('CHEBI', 117713),
('PathWhiz.Compound', 75237),
('RXNORM', 11317),
('SNOMED', 8668),
('DRUGBANK', 3386),
('GTPI', 1928),
('NCIT', 667),
('PathWhiz.Bound', 416),
('ttd.target', 331),
('PathWhiz.ElementCollection', 254),
('PathWhiz.NucleicAcid', 198),
('EFO', 40),
('FOODON', 13),
('GENEPIO', 5),
('CHEMBL.TARGET', 5)]
so, interestingly, 82% of the drug/chemical_substance synonym groups have a CHEMBL.COMPOUND
as their 'preferred' curie...
also, it looks like there are 111,424 nodes with a preferred type of 'disease' in the canonicalized KG2, vs. 311,962 'disease' nodes in the regular KG2. (so that's 35%)
great, thanks! That's surprising to me! I (clearly from my last post) expected the CHEMBL.COMPOUND and UMLS numbers to be swapped. But, I suppose even discarding 193,000 nodes will help the prediction computation.
@dkoslicki @chunyuma @amykglen is this ok to close or is it still relevant?
From this document, question 3 The purpose of this issue to: