Required Steps for Depositing Profiles

gwaybio commented 4 years ago

I am working towards processing all Drug Repurposing data and adding the results in this repository. The cell health project (https://github.com/broadinstitute/cell-health) now requires that the data are uniformly processed, documented, and made available here.

I will outline below the necessary steps required to get the data and processing pipelines uploaded.

Make sure there are only small floating point differences between cytominer-derived profiles and pycytominer-derived profiles.
- We are discussing this in #3
- I noted a potential discrepancy in cytominer-based documentation that needs addressing
Implement broad sample specific annotations
- This implementation is a work in progress here cytomining/pycytominer#73
- @shntnu I will likely need some guidance on this specific point
Rerun the "all" profiles pipeline described in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (currently a private repo)
- This needs to be rerun with the updated robustize_mad normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.
Rerun 4.apply module in cell-health
- Only after steps 1-3 are complete, can I rerun the 4.apply module
- I will explore whether or not to make the lincs-cell-painting profile repository a submodule of the cell-health project

shntnu commented 4 years ago

Implement broad sample specific annotations

This implementation is a work in progress here cytomining/pycytominer#73

@shntnu I will likely need some guidance on this specific point

@gwaygenomics Can remind about the input you need on this? I'll use cytotools/annotate as a reference to provide inputs.

gwaybio commented 4 years ago

Can remind about the input you need on this? I'll use cytotools/annotate as a reference to provide inputs.

Ah, that is a good reference, thanks for the pointer.

I wasn't sure about the cytominer strategy of splitting core functionality from cyto-specific functionality so I put cytominer progress on hold. The primary reason for putting it on hold was so that the lincs data could be processed with a more stable (and thus more reproducible) tool.

However, it sounds like the stability of cytominer (and pycytominer) is likely to occur in a longer timeframe than we need the lincs profiles. A potential intermediate solution could be to freeze a pycytominer version using conda (after confirming floating point differences) for lincs-specific processing. What do you think?

shntnu commented 4 years ago

Rerun the "all" profiles pipeline described in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (currently a private repo)

This needs to be rerun with the updated robustize_mad normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.

Going forward, we will very likely produce at least two different Level 4a profiles

whole-well z-scored
DMSO z-scored because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.

We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

Which among these profiles are best for an application is still an open research question. But until then, we just produce them all.

@gwaygenomics Does that sound reasonable?

This does complicate the analysis for cell-health because you now need to decide which of the two 4a profiles you should use for predictions. For that case, I'd go with whole-plate because that makes it similar to the way you've processed the CRISPR data IIRC>

shntnu commented 4 years ago

A potential intermediate solution could be to freeze a pycytominer version using conda (after confirming floating point differences) for lincs-specific processing. What do you think?

That sounds good to me, and will very likely be the strategy we will use for all data processing using pycytominer, right?

gwaybio commented 4 years ago

@shntnu and I chatted about this offline. I will summarize our decisions below:

I will confirm floating point differences in pycytominer (compared to current cytominer profiles)
I will apply the two normalization schemes (whole-well and DMSO)
These two normalization schemes will propagate to two separate feature selected files and two separate consensus files

Also, here are answers to the specific questions:

For that case, I'd go with whole-plate because that makes it similar to the way you've processed the CRISPR data IIR

I normalize profiles by EMPTY CRISPR perturbations. See here.

That sounds good to me, and will very likely be the strategy we will use for all data processing using pycytominer, right?

Similar, but not exactly the same. Eventually pycytominer will be traditionally versioned on pypi and conda. Currently, pycytominer is versioned by github hash (see here). It is also worth noting that we can always reprocess the profiles again. This is the beauty of versioned data!

gwaybio commented 4 years ago

@shntnu I have a couple followup questions now that I've started adding the processing code in #21 (cc @niranjchandrasekaran)

Question 1 - Should we use z-score normalization or `robustize_mad`?

Going forward, we will very likely produce at least two different Level 4a profiles whole-well z-scored DMSO z-scored because depending on the layout, one might be better than the other.

The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.

Question 2 - Is it ok to leave the whitened version for a future update?

We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

Question 3 - How should I form the level 5 consensus data?

My current plan is as follows:

Process each plate independently
Generate an across-plate consensus signature on broad_sample and dose.
The consensus signature will be based on median
Output one single file for the full consensus signature
Output a separate file for a feature selected consensus signature (derived after calculating consensus)

shntnu commented 4 years ago

The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.

Yes. Rationale: mostly empirical – robustize resulted in higher (compared to standardize) replicate correlations of Level 4 across a few experiments we tested this in.

Question 2 - Is it ok to leave the whitened version for a future update?

Yes, definitely ok.

How should I form the level 5 consensus data?

Your plan sounds good.

There's an incompatibility that I need to address in the handbook https://github.com/cytomining/profiling-handbook/issues/53. Ugh. So glad we are thinking through provenance and reproducibility via this project!

gwaybio commented 4 years ago

Closing this issue in favor of project management in https://github.com/broadinstitute/lincs-cell-painting/projects/1

broadinstitute / lincs-cell-painting