RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

What drugs could be repurposed to treat disease X? #942

Closed dkoslicki closed 4 months ago

dkoslicki commented 4 years ago

From this document, question 3 The purpose of this issue to:

dkoslicki commented 4 years ago

Summary of mini hackathon:

assume the QG from the ARS will simply look like:

{
   "edges": [
      {
         "id": "qg2",
         "source_id": "qg1",
         "target_id": "qg0"
      }
   ],
   "nodes": [
      {
         "id": "qg0",
         "curie": "DOID:1234"
      },
      {
         "id": "qg1",
         "type": "chemical_substance"
      }
   ]
}

Approach A) (classic current 1-hop lookup) DOID:1234 ---- chemical_substance Expand using KG2: gives known indicated/contraindicated Can use drug-treats-disease to rank/score these

Approach B) (????) DOID:1234 ---- chemical_substance Expand using predict-drug-treats-disease model database to find the top N chemical_substances Could use NGD to rank/score

Approach C) (needs qedge_qualifier=uncommon) QG: DOID:1234 ---- chemical_substance (could have qedge_qualifier=uncommon/speculative)

Approach D) (classic 1-hop, but “exhaustive” query option) QG: DOID:1234 ---- chemical_substance

Approach F)

DOID:1234 ---- protein ----- chemical_substance
   \                                  /
     \---------- delete --------------/
dkoslicki commented 4 years ago

Action items: create issues for:

chunyuma commented 4 years ago

I'm not sure if the predict-drug-treats-disease model database is feasible because it will be so big. Based on my rough calculation (only 539,840,312 drug-disease pairs occupy 28GB), the final database might have around 5TB which involves 388,966 drugs and 322,871 diseases (125,585,841,386 pairs). If we're ok for this size, do we have any servers to store this file?

dkoslicki commented 4 years ago

@chunyuma I see a couple ways to proceed:

  1. Only store those pairs that are above the threshold (0.8 or something like that, whatever the cutoff plot/random pairs plot indicated. We can assume that anything below this threshold is not predicted to treat, so we can ignore it.
  2. 1 above should massively reduce the size of the resulting database (since most drugs don’t treat most diseases), but if not, try looking into a hdf5 and its h5py implementation, which is specifically designed to hold insanely large numerical data like this (and we can talk about the correct compression techniques, data types, etc. so it has fast lookup).

In both cases, you will still need to compute all 125B probabilities, but after doing 1 above, we should only need to save a fraction of those (loop over all pairs, discard above threshold, only store ones above threshold = never need to have a 5TB intermediate data structure).

dkoslicki commented 4 years ago

Also, note: we will want to be able to do expands via PDTD (probability_drug_treats_disease) in both directions: (specific dug) —> (arbitrary disease) and (specific disease)—>(arbitrary drug), so you may need to create two versions of the data (or two relational databases: each that when given a specific drug or disease, returns all the diseases or drugs respectively above the threshold, along with the actual probability_treats).

chunyuma commented 4 years ago

@dkoslicki, I think one relational database is sufficient for both directions because of two reasons:

  1. The structure of database will be like: drug, disease, probability. So if we set 0.8 as threshold for probability, then it should filter out all pairs below 0.8 for both directions.

  2. Since we used Hadamard product, it doesn't matter which is source or which is target.

dkoslicki commented 4 years ago

@chunyuma Sounds good! As long as we can do fast lookups in both directions, whatever database you use is fine by me!

chunyuma commented 4 years ago

Update: Regarding the build of the predict-drug-treats-disease model database, we might need around 25 days to calculate the probabilities of all drug-disease pairs since we have 388,966 drugs and 322,871 diseases ( around 125B pairs).

I separated 322,871 diseases with all drug pairs into 1468 batches each having 220 diseases. Calculating the probabilities of one batch needs around 1515.644 seconds (each drug-disease pair needs (1515.6438915729523*1000)/(388966*220)=0.01771 millisecond). Therefore, to finish all drug-disease pairs, we needs 1468/((3600/1515.64)*24)=25.75 days.

dkoslicki commented 4 years ago

Idea: only make predictions for drugs that have a synonym with a curie with a prefix from a specific list TBD (eg. ignore CUI, UMLS, etc., but use CHEMBL and/or CHEBI)

edeutsch commented 4 years ago

...or in the canonicalized KG2, only use nodes where the curie itself is CHEMBL, CHEBI, or DRUGBANK or something. Maybe we could look at how many drugs/chemical_substances in KG2canon are there for each CURIE prefix? e.g.: CHEMBL.COMPOUND: 15022 CHEMBI: 345 DRUGBANK: 923 CUI: 2194287 that might help influence the decision..

amykglen commented 4 years ago

here are some counts from the trial full canonicalized build (includes drugs and chemical_substances):

[('CHEMBL.COMPOUND', 1820253), 
('UMLS', 193599), 
('CHEBI', 117713), 
('PathWhiz.Compound', 75237), 
('RXNORM', 11317), 
('SNOMED', 8668), 
('DRUGBANK', 3386), 
('GTPI', 1928), 
('NCIT', 667), 
('PathWhiz.Bound', 416), 
('ttd.target', 331), 
('PathWhiz.ElementCollection', 254), 
('PathWhiz.NucleicAcid', 198), 
('EFO', 40), 
('FOODON', 13), 
('GENEPIO', 5), 
('CHEMBL.TARGET', 5)]

so, interestingly, 82% of the drug/chemical_substance synonym groups have a CHEMBL.COMPOUND as their 'preferred' curie...

amykglen commented 4 years ago

also, it looks like there are 111,424 nodes with a preferred type of 'disease' in the canonicalized KG2, vs. 311,962 'disease' nodes in the regular KG2. (so that's 35%)

edeutsch commented 4 years ago

great, thanks! That's surprising to me! I (clearly from my last post) expected the CHEMBL.COMPOUND and UMLS numbers to be swapped. But, I suppose even discarding 193,000 nodes will help the prediction computation.

finnagin commented 2 years ago

@dkoslicki @chunyuma @amykglen is this ok to close or is it still relevant?

dkoslicki commented 4 months ago

Closing due to: https://doi-org.ezaccess.libraries.psu.edu/10.1093/gigascience/giad057