Whitening performed to correct batch affects

hkhawar commented 4 years ago

Z-normalization of features is performed and a total of 471 features are selected for downstream analysis

Plate cell densities

Metadata_Plate': {SQ00015142, 68708
                             SQ00015143, 63737
                             SQ00015144, 63796
                             SQ00015145, 66987
                             SQ00015201, 56739
                              }

UMAP clustering of single cell data clearly show batch affects among replicate plates

Modes_DMSO_UMAP

Whitening is applied on single cell data set and whitened data is plotted in an unsupervised way in UMAP space.

Whitening strategies

1- ZCA-sphering

1a - ZCA sphering applied to unnormalized features

UMAP_DMSO-whitened_cellprofiles_new_nonscaled_ZCA_randomstate_10_reduced_features

1b - ZCA sphering applied to z-normalized features

UMAP_DMSO-whitened_cellprofiles_new_scaled_ZCA_randomstate_10_reduced_features

ZCA sphering of z-normalized dataset gave best results in removing batch affects

PCA sphering applied to z-normalized features

UMAP_DMSO-whitened_cellprofiles_new_scaled_PCA_randomstate_10_reduced_features

gwaybio commented 4 years ago

Whitening approach has somewhat removed batch affects

I think "somewhat" is the key word :) I think that @sMyn42 can improve this with his Harmony implementation as described in #3

hkhawar commented 4 years ago

@shntnu @gwaygenomics Please have a look at the update whitening results. I am going to use 1b whitening strategy for later analysis

gwaybio commented 4 years ago

looks good @hkhawar - I recommend writing your code in such a way that we can easily drop in new single cell profiles in the future, for two reasons:

Anything we find with batch corrected data, I think we'd want to tell a story like "after batch correction we found insert finding here, which was obscured without applying batch correction". In other words, we'll want to run your code using uncorrected profiles as well.
We'll probably want to upgrade the batch effect correction approach. I am not sure how far @sMyn42 is coming along with the Harmony processing, but if he is unable to get to it, I will. In other words, let's think about this problem from two contexts A) DMSO cell state space heterogeneity and B) improving cell state space with batch effect correction tools

For 2B, we'll want to compare different tools (ZCA spherize, Harmony, unnormalized, etc.) so code should be modular.

hkhawar commented 4 years ago

@gwaygenomics Here is the code which have been used https://github.com/broadinstitute/2020-06-01-Evidence-of-state-switching-in-single-cell-drug-response-Broad/blob/master/Python_scripts/101420_switching_modes.ipynb

I will work on it so that can usable for single cell profiles in future. We definitely have heterogenous single cell dmso population + batch effects in this data. Definitely batch correction implementation removes batch effect. We need to apply some kind of test how much is the improvement for each method

gwaybio commented 4 years ago

Sounds great @hkhawar !

We need to apply some kind of test how much is the improvement for each method

Totally agree 💯 - @sMyn42 was also working on a method (kBET) to quantify the impact of batch, which we can apply to all forms of the profiles we generate (unnormalized and different batch corrected data). Alternatively, we can do something similar to what we saw in one of the cytodata presentations yesterday:

Apply PCA to all forms of the single cell data
Perform an ANOVA per components using plate as indicator and track F score per component. We'd like to see decreasing F scores in decreasing component numbers

hkhawar commented 4 years ago

@gwaygenomics sounds good.

shntnu commented 4 years ago

@hkhawar I wasn't able to get to this sooner, but looking forward to discussing tomorrow.

hkhawar commented 4 years ago

@shntnu

I have subtracted mean of DMSO cells from the feature matrix and then applying a whitening transformation on it.

Here below is the code

https://github.com/broadinstitute/2020-06-01-Evidence-of-state-switching-in-single-cell-drug-response-Broad/blob/master/Python_scripts/functions/whitening.py#L20-L35

While following below is Juan code which is similar to the one I use

https://github.com/broadinstitute/2020-06-01-Evidence-of-state-switching-in-single-cell-drug-response-Broad/blob/master/Python_scripts/functions/functions_utils.py#L104-L115

shntnu commented 4 years ago

@hkhawar The spherizing (aka whitening) transformation is good.

The issue is that you want to apply this at the aggregate level (well level) and then transfer those per-well transformations to the cells in that well.

So let's say you spherize 24 DMSO wells on a plate.

Now, say, the mean profile of well A01 was p before transformation and p' after transformation.

What I am proposing is that you add the vector (p'-p) to each single cell in A01. And then Repeat for all wells on the plate.

Before implementing, please check with Juan if he has tested this in LUAD already.

gwaybio commented 4 years ago

Now, say, the mean profile of well A01 was p before transformation and p' after transformation.

What I am proposing is that you add the vector (p'-p) to each single cell in A01.

Interesting! I think this approach sounds great. pycytominer will not be able to handle this specific procedure, but it should be fairly straightforward to implement with a few pycytominer hacks.

@hkhawar - I can add it to #6 if you think it makes sense. What is the plan for next steps?

hkhawar commented 4 years ago

@shntnu Now I got it what you actually meant. I will ask Juan about it.

hkhawar commented 4 years ago

@gwaygenomics I am not familiar using Pycytominer profiling pipeline. Is there any guideline to use it?

shntnu commented 4 years ago

@shntnu Now I got it what you actually meant. I will ask Juan about it.

Great, feel free to tag me in any conversation if needed, but pretty sure this will work.

gwaybio commented 4 years ago

@gwaygenomics I am not familiar using Pycytominer profiling pipeline. Is there any guideline to use it?

The documentation is still sparse. I have been working on it on and off for several months, but it still needs more love. I hope that people will be able to contribute to this specifically, at the next jamboree!

There is some minor documentation available here though: https://github.com/cytomining/pycytominer

Also, @hkhawar - once you think this is the right plan, I would be happy to contribute to it in #6 - I estimate that it will take me 1 hour to do.

hkhawar commented 4 years ago

@gwaygenomics Sounds good to me

hkhawar commented 4 years ago

@gwaygenomics @shntnu I checked with Juan and he told me that he didn't use this strategy. I have implemented this strategy and here below is the UMAP of single cells profiles from one plate.

Here are the steps which I followed

1) Created Aggregate well-level profiles P 2) Normalize and whitened transformed well-level profiles, which results in P' 3) For each well, computed residual profile which is subtraction between the transformed profile and the original profile: R = P' - P 4) add the residual profile of each well to single cell profiles in that well: single cell profiles + R

This separate out all 24 wells single cell profiles in UMAP space.

Here below are results with or without ZCA sphering

without Whitening

ZCA Whitening

Even the ZCA whitening approach it looks like it is trying to separate wells profiles in UMAP space. Only in unwhitened data we don't see this pattern.

shntnu commented 4 years ago

@hkhawar thank you for testing this out! Clearly, that approach produced the least useful result thus far but very insightful.

But I'm a bit confused about the non-spherized data

In https://github.com/broadinstitute/2020-06-01-Evidence-of-state-switching-in-single-cell-drug-response-Broad/issues/2#issuecomment-721280300, you have this plot

In https://github.com/broadinstitute/2020-06-01-Evidence-of-state-switching-in-single-cell-drug-response-Broad/issues/2#issue-701271357, you have this plot:

Should these two look similar (except the colors)?

Oh wait, I think I get it. The first plot is only of a single plate (all 24 wells), correct?

hkhawar commented 4 years ago

@shntnu I have used a single plate SQ00015201 (red datapoint in the bottom plot). In the top one I used colormap based on wells.

hkhawar commented 4 years ago

@shntnu Sorry I was wrong actually SQ00015142 are plotted as yellow dot. I will check it why it is so

shntnu commented 3 years ago

Thanks for clarifying. The fact that all these 5 have similar structure gives hope that there's a way to eventually align these data.

But to get you moving forward: it might be sufficient to look at just one plate and do all the analysis within the plate (because it appears that the well-position effect may not be as strong as we thought).

(1) Any concerns? (2) Before doing so, we'd want to make sure that the within-plate well position effect is not too large in the DMSO single-cell profiles. How should we decide if it is not too large?

shntnu commented 3 years ago

PS – I ran MANOVA on the data and (unsurprisingly) found differences among the DMSO single-cell profiles. This is not terribly useful, but at least confirms that well position effects are present. I spent exactly 5 minutes on this so I can't say this is a good path to go down, but documenting here in case we want to go back to this later.

Click for details

```r library(tidyverse) dmso <- read_csv("~/Desktop/SQ00015142_dmso_normalized.csv") dmso_sample <- dmso_sample %>% group_by(Metadata_Well) %>% sample_n(400) dmso_sample_well <- dmso %>% select(Metadata_Well) dmso_sample_mat <- dmso %>% select(matches("^Cells_|^Cytoplasm|^Nuclei")) %>% as.matrix() m <- manova(dmso_sample_mat ~ dmso_sample_well$Metadata_Well) summary(m) ``` ``` Df Pillai approx F num Df den Df Pr(>F) dmso_sample_well$Metadata_Well 23 2.3968 16.64 10971 1569290 < 2.2e-16 *** Residuals 68684 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` ```r m_shuffled <- manova(dmso_sample_mat ~ sample(dmso_sample_well$Metadata_Well)) summary(m_shuffled) ``` ``` Df Pillai approx F num Df den Df Pr(>F) sample(dmso_sample_well$Metadata_Well) 23 0.15798 0.98933 10971 1569290 0.7841 Residuals 68684 ```

hkhawar commented 3 years ago

Thanks for the analysis. I am currently working with harmony corrected data which Greg generated a day ago and we will see if we see any improvement using that approach

hkhawar commented 3 years ago

I ran a manova script for harmony corrected data and found that differences are still persisted among wells

gwaybio commented 3 years ago

dang... one question: Is this for one plate, or all five? If all five, then we probably should include a batch factor in the manova.

One possible explanation: maybe only one (or a few wells) is driving the significance. See the Df = 23 bit? We can do a tukey's posthoc test to figure that out. Also, the approx F is not high... which is good! I do wonder what the F score is like for the other normalization strategies too.

I do something similar to this in broadinstitute/profiling-resistance-mechanisms#91

hkhawar commented 3 years ago

@gwaygenomics here below are the results from implementing second strategy

UMAP is trained on DMSO single cell and transformation of compound treated single cells in to a DMSO space

SQ00015142_umap_all_compounds

Note: DMSO cells are in the last panel. All the conditions looks similar except for differences in cell densities are seen for some conditions

broadinstitute / 2020-06-01-Evidence-of-state-switching-in-single-cell-drug-response-Broad