Closed shntnu closed 3 years ago
+1 on this. Mattias was very confused about this as well!
Actually, I think they should go here:
lincs-cell-painting/consensus
So I assume this means that the files in lincs-cell-painting/consensus
are not spherized. Then which normalization strategy was applied there? This is not clear from the respective notebook (consensus/build-consensus-signatures.ipynb
). Also, do "plate normalization" and "batch normalization" refer to the same procedure (as I would think)?
Very glad to have you both digging into this repo to uncover what is clear and what is not.
So I assume this means that the files in lincs-cell-painting/consensus are not spherized. Then which normalization strategy was applied there?
Correct, the profiles here are not spherized. We generate consensus signatures from the traditional level 4a normalized profiles.
From build-consensus-signature.ipynb
cell 5.
file_bases = {
"whole_plate": {
"input_file_suffix": "_normalized.csv.gz",
"output_file_suffix": ".csv.gz",
},
"dmso": {
"input_file_suffix": "_normalized_dmso.csv.gz",
"output_file_suffix": "_dmso.csv.gz",
},
}
We use these suffixes to load specific data levels.
Also, do "plate normalization" and "batch normalization" refer to the same procedure (as I would think)?
They typically don't mean the same thing, but I am not sure what context you're referring to. In that context, it's possible we weren't entirely accurate!
(plate normalization could be something like normalizing profiles only to DMSO controls per plate for a goal of aligning profiles across plates, while batch normalization might normalize multiple plates together across multiple batches for a goal of aligning profiles across batches)
"Correct, the profiles here are not spherized. We generate consensus signatures from the traditional level 4a normalized profiles."
About the batches: I agree, they don't necessarily mean the same thing but then that means that there are two possible types of normalization and it is not clear which one is applied (again, talking from the level of someone going through the repository description without reading the actual pycytominer source code). Since there are always differences between plates it's the first thing I think about when reading about normalization.
gotcha. Thanks!
Spherizing is in fact just one normalization method, but it happens at a different level. Level 4a data (mad robustize normalization) comes from per-plate profiles. Spherized data come from all level 4a profiles.
@FloHu - can you see if our discussion in #73 improves clarity on this specific point? And if not, can you describe it in the issue so that we can make all changes at once.
Let's stay on track with this issue specifically being about creating consensus spherized profiles (which i agree is tightly related to #73 and can probably be fixed in the same PR!)
@michaelbornholdt @FloHu or @shntnu - is anyone working on this currently or partially in the past? I might need this for an analysis in https://github.com/broadinstitute/lincs-profiling-comparison
I haven't worked on it
Completed in #76
I also haven't worked on this. Can't say I know where they are now. I assume they are 'hidden' with lfs in the consensus folder?
Given that we create a single CSV file for spherized in this notebook, it will easiest to compute consensus in the same notebook.
The output should be stored at
lincs-cell-painting/spherized_profiles/consensus
and be named2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_modz.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_modz.csv.gz
i.e.
median
andmodz
consensus for each of the two Batch 1 files in this directory.And same for Batch 2 (
2017_12_05_Batch2
)