Added consensus moa analysis files

AdeboyeML commented 3 years ago

@gway I have added the jupyter notebook that contains the code for the analysis on the consensus datasets to get the moa sizes/values per doses(1-7), I have also included the results of the analysis i.e. the heatmap plots and the csv files (folders).

gwaybio commented 3 years ago

this looks great @AdeboyeML - a couple of immediate things to change, before I take a deeper dive.

Let's move this entire analysis into a folder called 1.data-exploration.
Let's be sure to keep data extensions on csv files. For example, moa_sizes_consensus_datasets/consensus_median_moa_sizes should be moa_sizes_consensus_datasets/consensus_median_moa_sizes.csv.

The reason for 1 is that we want to perform several analyses in this repo, and this is just one component. Each new analysis will belong in an analysis submodule.

AdeboyeML commented 3 years ago

@gwaygenomics Thanks for the observations.

I have created the 1.Data-exploration folder (putting all files and sub-folders into the folder), and also updated the notebook with the null distribution analysis and I have also saved all the dataframe outputs as .csv files.

The .csv files has both the median_scores and p_values dataframes for each consensus dataset per dose (for each of the 231 MOAs with more than one compound).

gwaybio commented 3 years ago

Looks great @AdeboyeML - sorry this has taken me so long to review. I am just about ready to approve.

Two major things:

It looks like there is a large number of cells in the notebook that are not executed. Can you either execute the cell or delete the cells? Jupyter notebooks need to be executed sequentially, and if I am viewing a notebook without sequential execution, I have extreme doubts about the data analysis. (not saying that I doubt this one, just in general it is suboptimal practice)
Can you add in the figures you generated in https://github.com/broadinstitute/lincs-profiling-comparison/issues/2#issuecomment-723839481? They look great, and are not currently captured in this PR. I would also switch to histograms from density plots (in the seaborn jointplot) since the density tails extend far beyond p < 0 (which doesn't make sense).

Other pull request advice

I definitely appreciate the scope of this PR. It is not huge. I have reviewed big pull requests in the past and they are not fun! However, the jupyter notebook is quite long. In the future, we should consider including fewer analyses in a single jupyter notebook, and instead number the notebooks if they require sequential execution. (e.g. consensus_moa_sizes_per_dose_analysis.ipynb becomes 0.consensus_moa_sizes_per_dose_analysis.ipynb and one other script to visualize results, or something)
It is also good practice to include the nbconverted file from the .ipynb files. These nbconverted files can be generated with the following command: jupyter nbconvert --to=script --FilesWriter.build_directory=scripts/nbconverted *.ipynb. The nbconverted files are so much easier to review than .ipynb files.

Advice for myself

I think I might be needlessly holding you up on performing the next analyses, and I don't want to be! We value good and reproducible code extremely highly, but I need to be aware of when this search for perfection slows things up too much!

shntnu commented 3 years ago

Dropping in with a minor comment: Please add the column name for the first column (moa) to the CSV files e.g. https://raw.githubusercontent.com/broadinstitute/lincs-profiling-comparison/9bc5db8167674e2c8bec5cee3fcc043117acfbf6/1.Data-exploration/moa_sizes_consensus_datasets/median_dmso_null_p_values.csv

gwaybio commented 3 years ago

@AdeboyeML - we should prioritize merging in this contribution early next week. Here is a todo list:

[ ] Remove unexecuted cells and non-essential comments
[ ] Make sure all cells are executed in order
[ ] Make sure all columns have names in all data that is output (nice catch @shntnu)
[ ] Confirm null distribution (maybe exclude this for now - can you add an issue stub saying this will be added in the future)
[ ] Split the long notebook into four (or five) different, independent notebooks

I see people forgetting to add column names for indices all the time, and it bugs me! It is also a very easy fix:

data_moa_values = data_moa_values.reset_index().rename({"index": "moa"}, axis="columns")
save_to_csv(data_moa_values, 'moa_sizes_consensus_datasets', 'modz_moa_median_scores.csv')

AdeboyeML commented 3 years ago

@gwaygenomics - thanks for the observations and major points highlighted.

I have done all necessary corrections to the jupyter notebooks and csv files. I have re-pushed new notebooks and files to the lincs-profiling-comparison repo containing notebooks for different profiling technologies and have also created new independent notebooks for different use cases.

gwaybio commented 3 years ago

once you address these three things:

[ ] Add citation
[ ] Remove .ipynb_checkpoints file and add to .gitignore file (example here)
[ ] Add .py files for corresponding .ipynb files

Go ahead and merge.

Nice work! 🎉

AdeboyeML commented 3 years ago

@gwaygenomics all highlighted points have been addressed and corrected.

gwaybio commented 3 years ago

LGTM 👍 please merge at will

AdeboyeML commented 3 years ago

@gwaygenomics I do not think I have the permission to merge the pull request.

Only those with write access to this repository can merge pull requests.

gwaybio commented 3 years ago

ah, right. Just made you admin. You need to accept in an email, then you should be able to merge. Thanks for pointing this out!

broadinstitute / lincs-profiling-complementarity