Make decisions up to consensus clearer

michaelbornholdt commented 3 years ago

Talking to Mattias I noticed that there are some holes in terms of explaining what happens to get to the consensus data.

It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data
As Shantanu has already pointed out, the consensus data should be all in one place.
Can we have the spherized_profiles in the profiler folder, makes more sense I think
Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is
Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.

Sorry for all these suggestions at once. I don't what is on your plate (haha) right now @gwaygenomics so I'm unsure how to move forward with these suggestions.

CC: @FloHu, @shntnu @niranjchandrasekaran

FloHu commented 3 years ago

I completely agree. I'm new to this data set and not all steps are clear. Especially point (5) - in a way I would expect that normalisation is done by plate to correct for systematic differences between plates, at least that's what I am used to from other screens. And then 'mad_robustize' is more like an aggregation method applied after normalization, am I right?

FloHu commented 3 years ago

Also, 'normalized by whole plate' and 'normalized by DMSO': does this refer to all samples from one plate vs. all DMSO controls from one plate or is it all samples from one plate vs. all the DMSO samples from all > 130 plates?

gwaybio commented 3 years ago

Great to have these suggestions! And good to discuss here briefly before you invest time into making the actual fixes.

I'll respond to these items one by one:

It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data

Do you think it would be helpful to add this detail to profiles/README.md? Where do you think is best?

As Shantanu has already pointed out, the consensus data should be all in one place.

I agree.

Can we have the spherized_profiles in the profiler folder, makes more sense I think

I think so! I forget why we didn't in the first place. I think @shntnu might know - we made these decisions largely to mirror the organization the IP decided a while ago.

Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is

Sounds good! I think we'll want to have some written form of this for the paper, eventually. @niranjchandrasekaran - what do you think? Right now we don't have too much explanation in pycytominer.normalize(). Perhaps once we add sufficient documentation directly at the source (pycytominer) we can link to it from this repo 🤔

Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.

Ah, I was confused by this in https://github.com/broadinstitute/lincs-cell-painting/issues/72#issuecomment-880650846 too - are you talking about the name of the spherized output files (e.g. 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz)? The key here is "input normalized by" - this means the normalization procedure applied to the level 4 data. Happy to think through a more appropriate name.

niranjchandrasekaran commented 3 years ago

Sounds good! I think we'll want to have some written form of this for the paper, eventually. @niranjchandrasekaran - what do you think? Right now we don't have too much explanation in pycytominer.normalize(). Perhaps once we add sufficient documentation directly at the source (pycytominer) we can link to it from this repo 🤔

This README.md currently points to profile_cells.py for more details. I wonder if we could add a section, briefly describing the two flavors of normalization, the method used for normalization (robustize, standardize, etc.) and what level of data the operation is performed, before pointing the reader to profile_cells.py. I guess this basically means describing the profiling pipeline in words.

gwaybio commented 3 years ago

sounds great. @michaelbornholdt do these descriptions help?

I'm unsure how to move forward with these suggestions.

I recommend that you file a pull request that adds documentation to specifically address the points of confusion we discuss above. @niranjchandrasekaran or I can review the PR. I don't think it'll take long, and it will be very helpful to have at least one other person double-checking documentation for clarity.

michaelbornholdt commented 3 years ago

Ok, I can do that. Let's agree on what that PR contains then!

Add a paragraph to profiles/README.md describing
1. The different normalization techniques (short word summary and links)
2. The different pipelines (mad vs sphere) and where exactly there are normalized to what.
3. An exact list of steps how to reproduce the consensus data
4. Add description of what the file names mean eg 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz
Move consensus to consensus and spherized profiles to the other profiles folder
Update the flow diagram to represent the difference between Madrob and spherized?

gwaybio commented 3 years ago

sounds good! Up to you if you want to also address #72 in this PR - having two PRs is sometimes better

For the moving around of files, I think it makes sense. It's also very possible that I am missing/forgetting a critical detail of why we have it this way - @shntnu will need to double check this.

There's no reason why you can't get started on that first paragraph and updating the flow diagram (LMK if you need pointers on this flow diagram) before hearing from Shantanu.

michaelbornholdt commented 3 years ago

Yes please, I have been on the doc where all the flow diagrams are but don't have the link anymore :)

gwaybio commented 3 years ago

I'll send you the link via slack. Feel free to completely edit this document (I've saved a duplicate elsewhere).

You can also feel free to depart from the pycytominer conventions and to create a pipeline diagram that specifically discusses our pipeline in this specific repo. I can use this figure in our paper in progress ;)

shntnu commented 3 years ago

Move consensus to consensus and spherized profiles to the other profiles folder

I didn't follow. Can you clarify what the new folder structure would look like @michaelbornholdt

michaelbornholdt commented 3 years ago

I didn't follow. Can you clarify what the new folder structure would look like @michaelbornholdt

The new structure would be | - consensus | -- batch1 | --- spherized | --- non spherized

| - profiles | -- non spherized | -- spherized

michaelbornholdt commented 3 years ago

Here is the first iteration of the new flow. is it more intuitive? LINCS Cell Painting - Data Pipeline (1)

shntnu commented 3 years ago

The new structure would be

I'd favor keeping things within existing structures if possible; I made some edits to this: https://github.com/cytomining/profiling-handbook/issues/54

That would mean that the spherize script would live in a different folder from its output, but at least everything is together. What do you think? @michaelbornholdt

gwaybio commented 3 years ago

Here is the first iteration of the new flow. is it more intuitive?

Nice! A couple comments and a decision point for you.

Comments

I like how you've edited the text to be more specific to this repo. Let's go all in with this. Instead of highlighting in red which steps we performed, delete any non-pertinent info.
I also like that you've included the options we used (e.g. the two different samples in normalization). Rename "samples" --> "options" and then include the actual function arguments, i.e. samples="all" and samples="Metadata_sample =='DMSO'" (I didn't check if that last one is the actual function argument, please do)
I don't think this is where spherize belongs. It takes in level 4a profiles (normalized in various ways) and spherizes using DMSO samples only. Since you're about to add consensus signatures of spherized profiles, anticipate this step too. You'll need to make two more cylinders to represent the two additional output data types (spherized profiles and spherized consensus profiles).
Make the consensus method option representative of what we actually did - MODZ and median.
drop cytominer-eval, it's not specific to this repo
use the blue background rectangle to represent all the data that's stored in this repo, instead of representing pycytominer. You might even note that level 3, 4a, and 4b data are stored via dvc and consensus/spherized are stored via git lfs

Decision point

The comments I made above are specific to the data in this repo. We might also decide to only include the data processing we used in the LINCS complementarity paper (see here for the exact data types we used). I might actually be in favor of this since we'll also be able to use it as a supplementary figure in that paper and it will be easier to make - although it might be misleading to only include some of the data versions we created. Maybe we can create both figures

FloHu commented 3 years ago

I think your comments and especially the new workflow figure makes things a lot clearer, thanks! I agree with most of Greg's comments. Additional points from my side:

Normalization options 'all samples' vs. 'DMSO' samples still needs to make clear whether it's referring to within plate (i.e. all samples from one plate or all from a set of plates).
So spherization is another normalization step on top of the per plate normalizations of 4a. It may be helpful to write in the README or the source code (ideally both) what is the motivation for those (clear for per plate but not for spherizing). I agree that it's good to show this as additional cylinders on the figure.
People may wonder if there is an additional correction for plate position effects (such as median polish) that is performed.

General question: is there an accompanying manuscript somewhere? I assume it is going to be the analysis coming out of the LINCS profiling analysis.

michaelbornholdt commented 3 years ago

@shntnu

That would mean that the spherize script would live in a different folder from its output

Now since the spherized profiles are a different level (kind of) than the other profiles, I am now convinced to keep them apart. As they are right now.

@gwaygenomics Thanks for the ideas, I will get working on that. I think the LINCS should definitely have its own flow diagram (already started it) if its convenient to make one for the paper, I can do that as well. Not aware of the difference yet. Maybe you can point me or explain in a quick call.

michaelbornholdt commented 3 years ago

Update on the graphic LINCS Cell Painting - Data Pipeline (2)

gwaybio commented 3 years ago

yes!!! Love it. Couple of final comments:

Shouldn't the arrow leading into Spherize come from the level 3 data and not the red block Normalize? Or should it be from level 4a? Also, I like that you've included Spherize separately - I think this makes things clear.
- Can you also name this Level 4as (and then also Level 4bs and Level 5s) data? We've used Level 4W in the past to denote whitened data (we call whiten spherize now, see https://github.com/broadinstitute/lincs-cell-painting/issues/38#issuecomment-701001829)
- Should we somehow note that we spherize data based on both whole plate and DMSO separately? Maybe this would make things too confusing?
I love the dotted lines for the data.
- Can you change lfs to git lifs
- I think the spherized data are git lfs too, right?
- Can you move the solid line to somehow include aggregate, annotate, and platemaps? we include those pieces in this repo too, it's just the level 1 profiles, CellProfiler, single cell processing and level 2 profiles that we don't include.
When you add this updated flow chart, can you remove the title? (where you sign it) Your name will be associated with the commit

FloHu commented 3 years ago

Looks a lot better now! I think spherize will always be applied on normalized data so it makes sense that it doesn't come from the "Level 3" cylinder.

michaelbornholdt commented 3 years ago

Continue discussion on: https://github.com/broadinstitute/lincs-cell-painting/pull/75

michaelbornholdt commented 3 years ago

Close since resulting PR was merged

broadinstitute / lincs-cell-painting

Make decisions up to consensus clearer #73

Comments

Decision point