Benchmarking test sets - Githubissues

dhimmel / learn

Machine learning and feature extraction for the Rephetio project

4 stars 5 forks source link

Benchmarking test sets #9

Open poleksic opened 4 years ago

poleksic commented 4 years ago

Hi Daniel, My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets. Thank you.

dhimmel commented 4 years ago

This first comment will contain helpful resources. I'll make a second comment to answer more of the questions.

My question concerns the benchmarking data sets in Fig. 3 of the paper. Are those available for download?

Embedding the figure below for reference:

Are those available for download?

Yes, I believe prediction/predictions/probabilities.tsv is the best place to see the full set of compound-disease pairs that can be filtered to generated the various benchmark sets.

compound_id	compound_name	disease_id	disease_name	prior_prob	prediction	training_prediction	compound_percentile	disease_percentile
DB01048	Abacavir	DOID:10652	Alzheimer's disease	0.004753	0.000930405137780005	0.00112945581330063	0.125	0.154746423927178
DB05812	Abiraterone	DOID:10652	Alzheimer's disease	0.004753	0.00379528958481219	0.00460442828313575	0.757352941176471	0.842652795838752
DB00659	Acamprosate	DOID:10652	Alzheimer's disease	0.004753	0.0162300916490301	0.0196380147334522	0.985294117647059	0.988296488946684
DB00284	Acarbose	DOID:10652	Alzheimer's disease	0.004753	0.00146927328449796	0.00178340350395021	0.595588235294118	0.368660598179454
DB01193	Acebutolol	DOID:10652	Alzheimer's disease	0.004753	0.00177375424093999	0.00215284205236242	0.772058823529412	0.472041612483745

dhimmel commented 4 years ago

In the probabilities.tsv table above:

status is the true labels column for "Disease Modifying" in the figure
status_trials is the true labels column for "Clinical Trial"
status_drugcentral is the true labels for "DrugCentral"

You'll see the numbers of positives and negatives in the figure will match the n_pos and n_neg in the cell 14 table in prediction/4-predictr.ipynb.

I created a Python notebook that shows that the columns in probabilities.tsv have the same counts of positives and negatives as Figure 3A. It also computes a status_sym column which equates to "Symptomatic" in the figure.

For Symptomatic, positives are compound-disease pairs with a palliates edge in Hetionet. Negatives are all other compound-disease pairs excluding disease-modifying indications. See https://github.com/dhimmel/learn/issues/7.

Figure 3A was made in the notebook prediction/6-vizr.ipynb. The relevant code here is:

grouped_df = prob_df %>%
  dplyr::mutate(DM = category %in% 'DM', SYM = category %in% 'SYM') %>%
  dplyr::rename(net_status=status) %>%
  tidyr::gather(context, status, DM, SYM, status_trials, status_drugcentral) %>%
  dplyr::filter(context == 'DM' | net_status == 0) %>%
  dplyr::filter(!is.na(status)) %>%

dhimmel commented 4 years ago

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications).

Yes!

But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.

207,572 is the number DrugCentral negatives. 208 is the number of DrugCentral positives. 1388 compound-disease pairs have a missing value for status_drugcentral representing all indications in PharmacotherapyDB (including DM, SYM, and NOT treatments). I don't recall why we remove NOT treatments from the negatives, but it's a small number of observations.

Does that answer everything?

poleksic commented 4 years ago

Thank you for a thorough explanation and for pointing to the probabilities.tsv file! On a side note, removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance (your classifier is trained to recognize negatives too). Does this make sense (or am I perhaps missing something simple)?

dhimmel commented 4 years ago

We define NOT treatments as:

non-indication meaning a drug that neither therapeutically changes the underlying or downstream biology nor treats a significant symptom of the disease.

So compound-disease pairs labeled NOT in PharmacotherapyDB are more like negative observations compared to positives. However, because they were included in upstream indication resources, these NOT pairs also may have some properties of treatments, but not enough to meet our definition of DM or SYM according to the curators.

removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance

There are only 243 NOT compound-disease pairs in PharmacotherapyDB. Since we're using essentially all non-positives as negatives, with the exclusions detailed above, the impact on training / performance of including or not-including these 243 pairs as negatives is trivial.