broadinstitute / lincs-profiling-complementarity

Analyzing and comparing signal found in different profiling technologies
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Expanded prediction analysis #60

Open gwaybio opened 2 years ago

gwaybio commented 2 years ago

We received reviews back from the journal, and one suggestion was for us to expand the machine learning prediction analysis.

Currently, we are using both L1000 and Cell Painting data to predict compound MOA. The reviewer asked us to also predict:

I think this is a great idea!

I performed the first step of this analysis in #59 - generating the X and Y matrices required to train our models and evaluate predictions. For example, the updated training data for Cell Painting is here: https://github.com/broadinstitute/lincs-profiling-complementarity/tree/master/2.MOA-prediction/2.data_split/model_data/cp

Next Step

The next step in this analysis is to run these matrices through our machine learning pipeline and return results for plotting. Currently, the pipeline trains several multi-class machine learning models to predict compound MOA. We need to modify this pipeline to also predict compound gene target and compound gene target pathway.

I also think that we need to modify the pipeline to train single-class machine learning models, given that there are about 30,000 unique pathways, and given our sample size, this seems infeasible. We can then pass through our three different Y matrices (per assay) through this single-class pipeline.

Output

We need performance metrics for each model, and metadata indicating which model, data, single-class vs. multi-class, shuffled status, and prediction.

It would also be great to output matrices of probabilities per compound by label (either compound, target, or pathway) per assay, model, single-class/multi-class, and shuffled status.

gwaybio commented 2 years ago

Our current figure 5, which visualizes the results for the multi-class MOA predictions across models is here: https://github.com/broadinstitute/lincs-profiling-complementarity/blob/master/6.paper_figures/figure5.ipynb

It might be helpful to mirror the output data to appear like the data frame we use for plotting in this notebook. (Note, we will need more metadata in the updated version!)

AdeboyeML commented 2 years ago

Yeah, expanding the multi-label predictive analysis will be great, but that will depend on the new datasets (gene targets and gene pathways) and how different or similar their features are to the existing datasets we used for the multi-label prediction.

I will go through the new datasets this week to see how it is and what I need to modify in the machine learning pipelines.

gwaybio commented 2 years ago

Thanks for going through the code, determining next steps, and meeting with me this afternoon @AdeboyeML .

I heard your concern that >5,000 GO terms is likely to make multi-label classification difficult. Therefore, in #67, I filtered GO terms that had less than 20 compounds. Most GO terms had only 1 compound, so this filtering step drastically reduced the GO term set to 772 GO terms, which is on the same order of magnitude as the MOA prediction task.