broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Add whitening normalization to this repo #38

Closed gwaybio closed 3 years ago

gwaybio commented 4 years ago

The profiles deposited in #34 do not include whitening normalization. Previously, (see https://github.com/broadinstitute/lincs-cell-painting/issues/4#issuecomment-620839509) I elected to leave the whitened data to a future data upload because of this caveat:

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

@shntnu also notes in https://github.com/broadinstitute/lincs-cell-painting/issues/4#issuecomment-608545838

Going forward, we will very likely produce at least two different Level 4a profiles

  • whole-well z-scored
  • DMSO z-scored

because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles. We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

gwaybio commented 4 years ago

Whitening has been fixed in pycytominer version

AdeboyeML commented 4 years ago

@shntnu @gwaygenomics - What whitening method should we use for the normalization of the profiles

gwaybio commented 4 years ago

getting this error "Divide by zero error, make sure low variance columns are removed", all columns with zero (0.0) values have to be dropped prior to whitening.

@AdeboyeML and I walked through this issue yesterday. The error is raised here. As I was writing the whitening methods, I noticed that the transformation fails when there are low variance features present.

Decision

Because of this error, let's form the whitening profiles using level 4b data instead of level 4a data (description of data levels).

gwaybio commented 4 years ago

Pending

What whitening method should we use for the normalization of the profiles

This is also a two part question (the same answer is probably the same for both questions).

  1. What whitening method should we use in this repo?
  2. What whitening method should we use as a reasonable default in pycytominer? (see cytomining/pycytominer#96)

@niranjchandrasekaran - I know you've done extensive testing on whitening variations. I also have UMAP profiles from one plate transformed using the different strategies (see below). Do you have a strong recommendation?

UMAP Coordinates of Four Whitening Methods

test_whitening.pdf

Click to see code that generated the pdf of figures ```python import umap import pandas as pd import plotnine as gg from pycytominer import normalize from pycytominer.cyto_utils import infer_cp_features # Load data commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98" batch = "2016_04_01_a549_48hr_batch1" plate = "SQ00014812" profile_file = f"{plate}_normalized_feature_select.csv.gz" base_url = "https://github.com/broadinstitute/lincs-cell-painting/raw/" url = f"{base_url}{commit}/profiles/{batch}/{plate}/{profile_file}" df = pd.read_csv(url) # Apply transformations, UMAP transform, and plot plotlist = [] for method in ["PCA", "ZCA", "PCA-cor", "ZCA-cor", "mad_robustize"]: for dmso_norm in [True, False]: if dmso_norm: samples = "Metadata_broad_sample == 'DMSO'" label = "DMSO normalized" else: samples = "all" label = "All samples normalized" if method == "mad_robustize": transform = "mad_robustize" label = f"MAD Robustize\n{label}" else: transform = "whiten" label = f"{method} Whitening\n{label}" normalize_df = normalize( df, features="infer", meta_features="infer", samples=samples, method=transform, output_file="none", compression=None, float_format=None, whiten_center=False, whiten_method=method ) cp_features = infer_cp_features(normalize_df) meta_features = infer_cp_features(normalize_df, metadata=True) # Apply UMAP reducer = umap.UMAP(random_state=123) embedding_df = reducer.fit_transform(normalize_df.loc[:, cp_features]) embedding_df = pd.DataFrame(embedding_df) embedding_df.columns = ["x", "y"] embedding_df = pd.concat( [ normalize_df.loc[:, meta_features], embedding_df ], axis="columns" ) embedding_df = embedding_df.assign(dmso_label="DMSO") embedding_df.loc[embedding_df.Metadata_broad_sample != "DMSO", "dmso_label"] = "compound" embedding_gg = ( gg.ggplot(embedding_df, gg.aes(x="x", y="y")) + gg.geom_point(gg.aes(size="Metadata_mg_per_ml", color="Metadata_broad_sample"), alpha=0.5) + gg.facet_grid("~dmso_label") + gg.ggtitle(label) + gg.theme_bw() + gg.theme(legend_position="none", strip_background=gg.element_rect(colour="black", fill="#fdfff4")) ) plotlist.append(embedding_gg) gg.save_as_pdf_pages(plotlist, "test_whitening.pdf") ```

Based on these qualitative results, I think we should definitely normalize using DMSO profiles. The other options (PCA, PCA-cor, ZCA, ZCA-cor) are less clear.

niranjchandrasekaran commented 4 years ago

@gwaygenomics ZCA-cor has been my go-to method in the JUMP-CP pilots. DMSO based standardization has also worked quite well. PCA based whitening hasn't worked well in my hands though it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).

Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.

gwaybio commented 4 years ago

Thanks @niranjchandrasekaran

it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).

Interesting! We are planning on doing plate-wise whitening - I don't see a benefit of platemap-wise normalization, but perhaps I am missing a key piece.

Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.

Cool. I believe ZCA-cor = good performance and PCA = bad performance is also what @AdeboyeML observed. We should also apply ZCA-cor using only DMSO profiles in the lincs dataset. Like this:

whitened_df = normalize(
    normalized_feature_selected_df,  # For each of the two level 4A profiles
    features="infer",
    meta_features="infer",
    samples="Metadata_broad_sample == 'DMSO'",  # This is the key arg to learn the whiten transform using only DMSO
    method="whiten",
    whiten_center=False,
    whiten_method="ZCA-cor"
)
gwaybio commented 4 years ago

We decided today at profiling checkin that ZCA-cor against DMSO profiles per-plate is the way to go

AdeboyeML commented 4 years ago

@gwaygenomics Yes, ZCA-cor will be used as default for the whitening.

I think it is best to set as default samples=all and whiten_center=True

gwaybio commented 4 years ago

@AdeboyeML - when you visualize the heatmaps, are you looking at only the DMSO profiles? We do not expect to see a decorrelated result in the full plate.

Also, can you post the resulting heatmap in this issue? It'll be great to refer back to in the future, for our future selves!

AdeboyeML commented 4 years ago

@gwaygenomics - So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.

- After ZCA-cor Whitening -- using the below parameters:

whitened_df = normalize(
    normalized_feature_selected_df,  # For each of the two level 4b profiles
    features="infer",
    meta_features="infer",
    samples="Metadata_broad_sample == 'DMSO'",  # This is the key arg to learn the whiten transform using only DMSO
    method="whiten",
    whiten_center=False,
    whiten_method="ZCA-cor"
)

newplot (65)

- After ZCA-cor Whitening -- using the below parameters:

whitened_df = normalize(
    normalized_feature_selected_df,  # For each of the two level 4b profiles
    features="infer",
    meta_features="infer",
    samples="all"
    method="whiten",
    whiten_center= True,
    whiten_method="ZCA-cor"
)

newplot (66)

gwaybio commented 4 years ago

So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.

great, this is exactly what we want to ultimately do.

My question is, which profiles are you using to generate the heatmap? In each of the two files (normalized_feature_select_DMSO and normalized_feature_select profiles) there are 384 profiles. Only a small portion of them (~20 I think) are treated with DMSO (negative control). We should be building the heatmap with the two profiles subset to only DMSO treatment wells when using normalize() with samples="Metadata_broad_sample == 'DMSO'".

Does this make sense?

AdeboyeML commented 4 years ago

@gwaygenomics Yes, I think I now understand your question. There are 24 portion of each profile that are treated with DMSO. These 24 DMSO treated wells have the same correlation results as the samples="Metadata_broad_sample == 'DMSO'"

gwaybio commented 4 years ago

Note that we should also update pycytominer version #53 and that we are rebranding whiten to spherize (they are synonyms) (see cytomining/pycytominer#102)

gwaybio commented 3 years ago

I added a first pass spherize implementation for batch 1 and batch 2 data in #60

gwaybio commented 3 years ago

60 is now merged