broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Make decisions up to consensus clearer #73

Closed michaelbornholdt closed 3 years ago

michaelbornholdt commented 3 years ago

Talking to Mattias I noticed that there are some holes in terms of explaining what happens to get to the consensus data.

  1. It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data
  2. As Shantanu has already pointed out, the consensus data should be all in one place.
  3. Can we have the spherized_profiles in the profiler folder, makes more sense I think
  4. Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is
  5. Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.

Sorry for all these suggestions at once. I don't what is on your plate (haha) right now @gwaygenomics so I'm unsure how to move forward with these suggestions.

CC: @FloHu, @shntnu @niranjchandrasekaran

FloHu commented 3 years ago

I completely agree. I'm new to this data set and not all steps are clear. Especially point (5) - in a way I would expect that normalisation is done by plate to correct for systematic differences between plates, at least that's what I am used to from other screens. And then 'mad_robustize' is more like an aggregation method applied after normalization, am I right?

FloHu commented 3 years ago

Also, 'normalized by whole plate' and 'normalized by DMSO': does this refer to all samples from one plate vs. all DMSO controls from one plate or is it all samples from one plate vs. all the DMSO samples from all > 130 plates?

gwaybio commented 3 years ago

Great to have these suggestions! And good to discuss here briefly before you invest time into making the actual fixes.

I'll respond to these items one by one:

  1. It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data

Do you think it would be helpful to add this detail to profiles/README.md? Where do you think is best?

  1. As Shantanu has already pointed out, the consensus data should be all in one place.

I agree.

  1. Can we have the spherized_profiles in the profiler folder, makes more sense I think

I think so! I forget why we didn't in the first place. I think @shntnu might know - we made these decisions largely to mirror the organization the IP decided a while ago.

  1. Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is

Sounds good! I think we'll want to have some written form of this for the paper, eventually. @niranjchandrasekaran - what do you think? Right now we don't have too much explanation in pycytominer.normalize(). Perhaps once we add sufficient documentation directly at the source (pycytominer) we can link to it from this repo 🤔

  1. Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.

Ah, I was confused by this in https://github.com/broadinstitute/lincs-cell-painting/issues/72#issuecomment-880650846 too - are you talking about the name of the spherized output files (e.g. 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz)? The key here is "input normalized by" - this means the normalization procedure applied to the level 4 data. Happy to think through a more appropriate name.

niranjchandrasekaran commented 3 years ago

Sounds good! I think we'll want to have some written form of this for the paper, eventually. @niranjchandrasekaran - what do you think? Right now we don't have too much explanation in pycytominer.normalize(). Perhaps once we add sufficient documentation directly at the source (pycytominer) we can link to it from this repo 🤔

This README.md currently points to profile_cells.py for more details. I wonder if we could add a section, briefly describing the two flavors of normalization, the method used for normalization (robustize, standardize, etc.) and what level of data the operation is performed, before pointing the reader to profile_cells.py. I guess this basically means describing the profiling pipeline in words.

gwaybio commented 3 years ago

sounds great. @michaelbornholdt do these descriptions help?

I'm unsure how to move forward with these suggestions.

I recommend that you file a pull request that adds documentation to specifically address the points of confusion we discuss above. @niranjchandrasekaran or I can review the PR. I don't think it'll take long, and it will be very helpful to have at least one other person double-checking documentation for clarity.

michaelbornholdt commented 3 years ago

Ok, I can do that. Let's agree on what that PR contains then!

gwaybio commented 3 years ago

sounds good! Up to you if you want to also address #72 in this PR - having two PRs is sometimes better

For the moving around of files, I think it makes sense. It's also very possible that I am missing/forgetting a critical detail of why we have it this way - @shntnu will need to double check this.

There's no reason why you can't get started on that first paragraph and updating the flow diagram (LMK if you need pointers on this flow diagram) before hearing from Shantanu.

michaelbornholdt commented 3 years ago

Yes please, I have been on the doc where all the flow diagrams are but don't have the link anymore :)

gwaybio commented 3 years ago

I'll send you the link via slack. Feel free to completely edit this document (I've saved a duplicate elsewhere).

You can also feel free to depart from the pycytominer conventions and to create a pipeline diagram that specifically discusses our pipeline in this specific repo. I can use this figure in our paper in progress ;)

shntnu commented 3 years ago
  • Move consensus to consensus and spherized profiles to the other profiles folder

I didn't follow. Can you clarify what the new folder structure would look like @michaelbornholdt

michaelbornholdt commented 3 years ago

I didn't follow. Can you clarify what the new folder structure would look like @michaelbornholdt

The new structure would be | - consensus | -- batch1 | --- spherized | --- non spherized

| - profiles | -- non spherized | -- spherized

michaelbornholdt commented 3 years ago

Here is the first iteration of the new flow. is it more intuitive? LINCS Cell Painting - Data Pipeline (1)

shntnu commented 3 years ago

The new structure would be

I'd favor keeping things within existing structures if possible; I made some edits to this: https://github.com/cytomining/profiling-handbook/issues/54

That would mean that the spherize script would live in a different folder from its output, but at least everything is together. What do you think? @michaelbornholdt

gwaybio commented 3 years ago

Here is the first iteration of the new flow. is it more intuitive?

Nice! A couple comments and a decision point for you.

Comments

Decision point

The comments I made above are specific to the data in this repo. We might also decide to only include the data processing we used in the LINCS complementarity paper (see here for the exact data types we used). I might actually be in favor of this since we'll also be able to use it as a supplementary figure in that paper and it will be easier to make - although it might be misleading to only include some of the data versions we created. Maybe we can create both figures

FloHu commented 3 years ago

I think your comments and especially the new workflow figure makes things a lot clearer, thanks! I agree with most of Greg's comments. Additional points from my side:

General question: is there an accompanying manuscript somewhere? I assume it is going to be the analysis coming out of the LINCS profiling analysis.

michaelbornholdt commented 3 years ago

@shntnu

That would mean that the spherize script would live in a different folder from its output

Now since the spherized profiles are a different level (kind of) than the other profiles, I am now convinced to keep them apart. As they are right now.

@gwaygenomics Thanks for the ideas, I will get working on that. I think the LINCS should definitely have its own flow diagram (already started it) if its convenient to make one for the paper, I can do that as well. Not aware of the difference yet. Maybe you can point me or explain in a quick call.

michaelbornholdt commented 3 years ago

Update on the graphic LINCS Cell Painting - Data Pipeline (2)

gwaybio commented 3 years ago

yes!!! Love it. Couple of final comments:

FloHu commented 3 years ago

Looks a lot better now! I think spherize will always be applied on normalized data so it makes sense that it doesn't come from the "Level 3" cylinder.

michaelbornholdt commented 3 years ago

Continue discussion on: https://github.com/broadinstitute/lincs-cell-painting/pull/75

michaelbornholdt commented 3 years ago

Close since resulting PR was merged