Closed gwaybio closed 3 years ago
Whitening has been fixed in pycytominer version
@shntnu @gwaygenomics - What whitening method should we use for the normalization of the profiles
At present, I have tried the four methods on two profiles. I realized that in order to use the PCA-cor, and ZCA-cor without getting this error "Divide by zero error, make sure low variance columns are removed"
, all columns with zero (0.0) values have to be dropped prior to whitening.
Also, I realized that after using PCA and ZCA, a few (1 - 2) of the normalized columns returned zeros (0.0) as their values.
getting this error "Divide by zero error, make sure low variance columns are removed", all columns with zero (0.0) values have to be dropped prior to whitening.
@AdeboyeML and I walked through this issue yesterday. The error is raised here. As I was writing the whitening methods, I noticed that the transformation fails when there are low variance features present.
Because of this error, let's form the whitening profiles using level 4b data instead of level 4a data (description of data levels).
What whitening method should we use for the normalization of the profiles
This is also a two part question (the same answer is probably the same for both questions).
@niranjchandrasekaran - I know you've done extensive testing on whitening variations. I also have UMAP profiles from one plate transformed using the different strategies (see below). Do you have a strong recommendation?
Based on these qualitative results, I think we should definitely normalize using DMSO profiles. The other options (PCA, PCA-cor, ZCA, ZCA-cor) are less clear.
@gwaygenomics ZCA-cor has been my go-to method in the JUMP-CP pilots. DMSO based standardization has also worked quite well. PCA based whitening hasn't worked well in my hands though it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).
Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.
Thanks @niranjchandrasekaran
it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).
Interesting! We are planning on doing plate-wise whitening - I don't see a benefit of platemap-wise normalization, but perhaps I am missing a key piece.
Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.
Cool. I believe ZCA-cor
= good performance and PCA
= bad performance is also what @AdeboyeML observed. We should also apply ZCA-cor
using only DMSO profiles in the lincs dataset. Like this:
whitened_df = normalize(
normalized_feature_selected_df, # For each of the two level 4A profiles
features="infer",
meta_features="infer",
samples="Metadata_broad_sample == 'DMSO'", # This is the key arg to learn the whiten transform using only DMSO
method="whiten",
whiten_center=False,
whiten_method="ZCA-cor"
)
We decided today at profiling checkin that ZCA-cor
against DMSO profiles per-plate is the way to go
@gwaygenomics Yes, ZCA-cor will be used as default for the whitening.
samples="Metadata_broad_sample == 'DMSO'",
and whiten_center=False,
as the normalization parameters for the level 4b data (normalized_feature_select_DMSO and normalized_feature_select profiles), It doesn't give the expected de-correlation result.I think it is best to set as default
samples=all
andwhiten_center=True
@AdeboyeML - when you visualize the heatmaps, are you looking at only the DMSO profiles? We do not expect to see a decorrelated result in the full plate.
Also, can you post the resulting heatmap in this issue? It'll be great to refer back to in the future, for our future selves!
@gwaygenomics - So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.
whitened_df = normalize(
normalized_feature_selected_df, # For each of the two level 4b profiles
features="infer",
meta_features="infer",
samples="Metadata_broad_sample == 'DMSO'", # This is the key arg to learn the whiten transform using only DMSO
method="whiten",
whiten_center=False,
whiten_method="ZCA-cor"
)
whitened_df = normalize(
normalized_feature_selected_df, # For each of the two level 4b profiles
features="infer",
meta_features="infer",
samples="all"
method="whiten",
whiten_center= True,
whiten_method="ZCA-cor"
)
So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.
great, this is exactly what we want to ultimately do.
My question is, which profiles are you using to generate the heatmap? In each of the two files (normalized_feature_select_DMSO and normalized_feature_select profiles) there are 384 profiles. Only a small portion of them (~20 I think) are treated with DMSO (negative control). We should be building the heatmap with the two profiles subset to only DMSO treatment wells when using normalize()
with samples="Metadata_broad_sample == 'DMSO'"
.
Does this make sense?
@gwaygenomics Yes, I think I now understand your question. There are 24 portion of each profile that are treated with DMSO. These 24 DMSO treated wells have the same correlation results as the samples="Metadata_broad_sample == 'DMSO'"
Note that we should also update pycytominer version #53 and that we are rebranding whiten
to spherize
(they are synonyms) (see cytomining/pycytominer#102)
I added a first pass spherize implementation for batch 1 and batch 2 data in #60
The profiles deposited in #34 do not include whitening normalization. Previously, (see https://github.com/broadinstitute/lincs-cell-painting/issues/4#issuecomment-620839509) I elected to leave the whitened data to a future data upload because of this caveat:
@shntnu also notes in https://github.com/broadinstitute/lincs-cell-painting/issues/4#issuecomment-608545838