Open poleksic opened 4 years ago
This first comment will contain helpful resources. I'll make a second comment to answer more of the questions.
My question concerns the benchmarking data sets in Fig. 3 of the paper. Are those available for download?
Embedding the figure below for reference:
Are those available for download?
Yes, I believe prediction/predictions/probabilities.tsv
is the best place to see the full set of compound-disease pairs that can be filtered to generated the various benchmark sets.
compound_id | compound_name | disease_id | disease_name | category | status | prior_prob | prediction | training_prediction | compound_percentile | disease_percentile | n_trials | status_trials | status_drugcentral |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DB01048 | Abacavir | DOID:10652 | Alzheimer's disease | 0 | 0.004753 | 0.000930405137780005 | 0.00112945581330063 | 0.125 | 0.154746423927178 | 0 | 0 | 0 | |
DB05812 | Abiraterone | DOID:10652 | Alzheimer's disease | 0 | 0.004753 | 0.00379528958481219 | 0.00460442828313575 | 0.757352941176471 | 0.842652795838752 | 0 | 0 | 0 | |
DB00659 | Acamprosate | DOID:10652 | Alzheimer's disease | 0 | 0.004753 | 0.0162300916490301 | 0.0196380147334522 | 0.985294117647059 | 0.988296488946684 | 0 | 0 | 0 | |
DB00284 | Acarbose | DOID:10652 | Alzheimer's disease | 0 | 0.004753 | 0.00146927328449796 | 0.00178340350395021 | 0.595588235294118 | 0.368660598179454 | 0 | 0 | 0 | |
DB01193 | Acebutolol | DOID:10652 | Alzheimer's disease | 0 | 0.004753 | 0.00177375424093999 | 0.00215284205236242 | 0.772058823529412 | 0.472041612483745 | 0 | 0 | 0 |
In the probabilities.tsv
table above:
status
is the true labels column for "Disease Modifying" in the figurestatus_trials
is the true labels column for "Clinical Trial"status_drugcentral
is the true labels for "DrugCentral"You'll see the numbers of positives and negatives in the figure will match the n_pos
and n_neg
in the cell 14 table in prediction/4-predictr.ipynb
.
I created a Python notebook that shows that the columns in probabilities.tsv
have the same counts of positives and negatives as Figure 3A. It also computes a status_sym
column which equates to "Symptomatic" in the figure.
For Symptomatic, positives are compound-disease pairs with a palliates edge in Hetionet. Negatives are all other compound-disease pairs excluding disease-modifying indications. See https://github.com/dhimmel/learn/issues/7.
Figure 3A was made in the notebook prediction/6-vizr.ipynb
. The relevant code here is:
grouped_df = prob_df %>%
dplyr::mutate(DM = category %in% 'DM', SYM = category %in% 'SYM') %>%
dplyr::rename(net_status=status) %>%
tidyr::gather(context, status, DM, SYM, status_trials, status_drugcentral) %>%
dplyr::filter(context == 'DM' | net_status == 0) %>%
dplyr::filter(!is.na(status)) %>%
I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications).
Yes!
But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.
207,572 is the number DrugCentral negatives. 208 is the number of DrugCentral positives. 1388 compound-disease pairs have a missing value for status_drugcentral
representing all indications in PharmacotherapyDB (including DM, SYM, and NOT treatments). I don't recall why we remove NOT treatments from the negatives, but it's a small number of observations.
Does that answer everything?
Thank you for a thorough explanation and for pointing to the probabilities.tsv file! On a side note, removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance (your classifier is trained to recognize negatives too). Does this make sense (or am I perhaps missing something simple)?
We define NOT treatments as:
non-indication meaning a drug that neither therapeutically changes the underlying or downstream biology nor treats a significant symptom of the disease.
So compound-disease pairs labeled NOT
in PharmacotherapyDB are more like negative observations compared to positives. However, because they were included in upstream indication resources, these NOT
pairs also may have some properties of treatments, but not enough to meet our definition of DM or SYM according to the curators.
removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance
There are only 243 NOT
compound-disease pairs in PharmacotherapyDB. Since we're using essentially all non-positives as negatives, with the exclusions detailed above, the impact on training / performance of including or not-including these 243 pairs as negatives is trivial.
Hi Daniel, My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.
I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets. Thank you.