dhimmel / learn

Machine learning and feature extraction for the Rephetio project
https://doi.org/10.15363/thinklab.d210
4 stars 5 forks source link

Benchmarking test sets #9

Open poleksic opened 4 years ago

poleksic commented 4 years ago

Hi Daniel, My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets. Thank you.

dhimmel commented 4 years ago

This first comment will contain helpful resources. I'll make a second comment to answer more of the questions.

My question concerns the benchmarking data sets in Fig. 3 of the paper. Are those available for download?

Embedding the figure below for reference:

Are those available for download?

Yes, I believe prediction/predictions/probabilities.tsv is the best place to see the full set of compound-disease pairs that can be filtered to generated the various benchmark sets.

compound_id compound_name disease_id disease_name category status prior_prob prediction training_prediction compound_percentile disease_percentile n_trials status_trials status_drugcentral
DB01048 Abacavir DOID:10652 Alzheimer's disease 0 0.004753 0.000930405137780005 0.00112945581330063 0.125 0.154746423927178 0 0 0
DB05812 Abiraterone DOID:10652 Alzheimer's disease 0 0.004753 0.00379528958481219 0.00460442828313575 0.757352941176471 0.842652795838752 0 0 0
DB00659 Acamprosate DOID:10652 Alzheimer's disease 0 0.004753 0.0162300916490301 0.0196380147334522 0.985294117647059 0.988296488946684 0 0 0
DB00284 Acarbose DOID:10652 Alzheimer's disease 0 0.004753 0.00146927328449796 0.00178340350395021 0.595588235294118 0.368660598179454 0 0 0
DB01193 Acebutolol DOID:10652 Alzheimer's disease 0 0.004753 0.00177375424093999 0.00215284205236242 0.772058823529412 0.472041612483745 0 0 0
dhimmel commented 4 years ago

In the probabilities.tsv table above:

You'll see the numbers of positives and negatives in the figure will match the n_pos and n_neg in the cell 14 table in prediction/4-predictr.ipynb.

I created a Python notebook that shows that the columns in probabilities.tsv have the same counts of positives and negatives as Figure 3A. It also computes a status_sym column which equates to "Symptomatic" in the figure.

For Symptomatic, positives are compound-disease pairs with a palliates edge in Hetionet. Negatives are all other compound-disease pairs excluding disease-modifying indications. See https://github.com/dhimmel/learn/issues/7.

Figure 3A was made in the notebook prediction/6-vizr.ipynb. The relevant code here is:

grouped_df = prob_df %>%
  dplyr::mutate(DM = category %in% 'DM', SYM = category %in% 'SYM') %>%
  dplyr::rename(net_status=status) %>%
  tidyr::gather(context, status, DM, SYM, status_trials, status_drugcentral) %>%
  dplyr::filter(context == 'DM' | net_status == 0) %>%
  dplyr::filter(!is.na(status)) %>%
dhimmel commented 4 years ago

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications).

Yes!

But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.

207,572 is the number DrugCentral negatives. 208 is the number of DrugCentral positives. 1388 compound-disease pairs have a missing value for status_drugcentral representing all indications in PharmacotherapyDB (including DM, SYM, and NOT treatments). I don't recall why we remove NOT treatments from the negatives, but it's a small number of observations.

Does that answer everything?

poleksic commented 4 years ago

Thank you for a thorough explanation and for pointing to the probabilities.tsv file! On a side note, removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance (your classifier is trained to recognize negatives too). Does this make sense (or am I perhaps missing something simple)?

dhimmel commented 4 years ago

We define NOT treatments as:

non-indication meaning a drug that neither therapeutically changes the underlying or downstream biology nor treats a significant symptom of the disease.

So compound-disease pairs labeled NOT in PharmacotherapyDB are more like negative observations compared to positives. However, because they were included in upstream indication resources, these NOT pairs also may have some properties of treatments, but not enough to meet our definition of DM or SYM according to the curators.

removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance

There are only 243 NOT compound-disease pairs in PharmacotherapyDB. Since we're using essentially all non-positives as negatives, with the exclusions detailed above, the impact on training / performance of including or not-including these 243 pairs as negatives is trivial.