Closed michaelbornholdt closed 3 years ago
I completely agree. I'm new to this data set and not all steps are clear. Especially point (5) - in a way I would expect that normalisation is done by plate to correct for systematic differences between plates, at least that's what I am used to from other screens. And then 'mad_robustize' is more like an aggregation method applied after normalization, am I right?
Also, 'normalized by whole plate' and 'normalized by DMSO': does this refer to all samples from one plate vs. all DMSO controls from one plate or is it all samples from one plate vs. all the DMSO samples from all > 130 plates?
Great to have these suggestions! And good to discuss here briefly before you invest time into making the actual fixes.
I'll respond to these items one by one:
- It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data
Do you think it would be helpful to add this detail to profiles/README.md
? Where do you think is best?
- As Shantanu has already pointed out, the consensus data should be all in one place.
I agree.
- Can we have the spherized_profiles in the profiler folder, makes more sense I think
I think so! I forget why we didn't in the first place. I think @shntnu might know - we made these decisions largely to mirror the organization the IP decided a while ago.
- Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is
Sounds good! I think we'll want to have some written form of this for the paper, eventually. @niranjchandrasekaran - what do you think? Right now we don't have too much explanation in pycytominer.normalize()
. Perhaps once we add sufficient documentation directly at the source (pycytominer) we can link to it from this repo 🤔
- Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.
Ah, I was confused by this in https://github.com/broadinstitute/lincs-cell-painting/issues/72#issuecomment-880650846 too - are you talking about the name of the spherized output files (e.g. 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz
)? The key here is "input normalized by" - this means the normalization procedure applied to the level 4 data. Happy to think through a more appropriate name.
Sounds good! I think we'll want to have some written form of this for the paper, eventually. @niranjchandrasekaran - what do you think? Right now we don't have too much explanation in pycytominer.normalize(). Perhaps once we add sufficient documentation directly at the source (pycytominer) we can link to it from this repo 🤔
This README.md currently points to profile_cells.py
for more details. I wonder if we could add a section, briefly describing the two flavors of normalization, the method used for normalization (robustize
, standardize
, etc.) and what level of data the operation is performed, before pointing the reader to profile_cells.py
. I guess this basically means describing the profiling pipeline in words.
sounds great. @michaelbornholdt do these descriptions help?
I'm unsure how to move forward with these suggestions.
I recommend that you file a pull request that adds documentation to specifically address the points of confusion we discuss above. @niranjchandrasekaran or I can review the PR. I don't think it'll take long, and it will be very helpful to have at least one other person double-checking documentation for clarity.
Ok, I can do that. Let's agree on what that PR contains then!
2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz
sounds good! Up to you if you want to also address #72 in this PR - having two PRs is sometimes better
For the moving around of files, I think it makes sense. It's also very possible that I am missing/forgetting a critical detail of why we have it this way - @shntnu will need to double check this.
There's no reason why you can't get started on that first paragraph and updating the flow diagram (LMK if you need pointers on this flow diagram) before hearing from Shantanu.
Yes please, I have been on the doc where all the flow diagrams are but don't have the link anymore :)
I'll send you the link via slack. Feel free to completely edit this document (I've saved a duplicate elsewhere).
You can also feel free to depart from the pycytominer conventions and to create a pipeline diagram that specifically discusses our pipeline in this specific repo. I can use this figure in our paper in progress ;)
- Move consensus to consensus and spherized profiles to the other profiles folder
I didn't follow. Can you clarify what the new folder structure would look like @michaelbornholdt
I didn't follow. Can you clarify what the new folder structure would look like @michaelbornholdt
The new structure would be | - consensus | -- batch1 | --- spherized | --- non spherized
| - profiles | -- non spherized | -- spherized
Here is the first iteration of the new flow. is it more intuitive?
The new structure would be
I'd favor keeping things within existing structures if possible; I made some edits to this: https://github.com/cytomining/profiling-handbook/issues/54
That would mean that the spherize script would live in a different folder from its output, but at least everything is together. What do you think? @michaelbornholdt
Here is the first iteration of the new flow. is it more intuitive?
Nice! A couple comments and a decision point for you.
samples="all"
and samples="Metadata_sample =='DMSO'"
(I didn't check if that last one is the actual function argument, please do)The comments I made above are specific to the data in this repo. We might also decide to only include the data processing we used in the LINCS complementarity paper (see here for the exact data types we used). I might actually be in favor of this since we'll also be able to use it as a supplementary figure in that paper and it will be easier to make - although it might be misleading to only include some of the data versions we created. Maybe we can create both figures
I think your comments and especially the new workflow figure makes things a lot clearer, thanks! I agree with most of Greg's comments. Additional points from my side:
General question: is there an accompanying manuscript somewhere? I assume it is going to be the analysis coming out of the LINCS profiling analysis.
@shntnu
That would mean that the spherize script would live in a different folder from its output
Now since the spherized profiles are a different level (kind of) than the other profiles, I am now convinced to keep them apart. As they are right now.
@gwaygenomics Thanks for the ideas, I will get working on that. I think the LINCS should definitely have its own flow diagram (already started it) if its convenient to make one for the paper, I can do that as well. Not aware of the difference yet. Maybe you can point me or explain in a quick call.
Update on the graphic
yes!!! Love it. Couple of final comments:
Spherize
come from the level 3 data and not the red block Normalize
? Or should it be from level 4a? Also, I like that you've included Spherize
separately - I think this makes things clear.
Level 4as
(and then also Level 4bs
and Level 5s
) data? We've used Level 4W
in the past to denote whitened data (we call whiten spherize now, see https://github.com/broadinstitute/lincs-cell-painting/issues/38#issuecomment-701001829)lfs
to git lifs
git lfs
too, right?Looks a lot better now! I think spherize will always be applied on normalized data so it makes sense that it doesn't come from the "Level 3" cylinder.
Continue discussion on: https://github.com/broadinstitute/lincs-cell-painting/pull/75
Close since resulting PR was merged
Talking to Mattias I noticed that there are some holes in terms of explaining what happens to get to the consensus data.
spherized_profiles
in the profiler folder, makes more sense I thinkSorry for all these suggestions at once. I don't what is on your plate (haha) right now @gwaygenomics so I'm unsure how to move forward with these suggestions.
CC: @FloHu, @shntnu @niranjchandrasekaran