Closed gwaybio closed 3 years ago
Very cool analysis of the Stdev wrt performance, that looks very sensible and was a clever thing to look at.
But wow, we are definitely losing a lot of resolving power if the DMSOs are so spread out by plate layout effects alone. That definitely is worth fixing... I'm not saying it's necessary to fix for this project per se but we will use this data a lot and so it's definitely worth doing. The Q is whether we have sufficient data to fix it and how would we actually carry this out! Whitening is one option for fixing batch to batch effects but not for plate layout effects. Will need to await Shantanu's advice on this one.
Neat analysis!
every cell health model has higher standard deviation in compounds compared to DMSO consensus
Ideally s.d. for DMSO should be zero for all models, correct?
Whitening is one option for fixing batch to batch effects but not for plate layout effects. Will need to await Shantanu's advice on this one.
Whitening can help with plate layout effects to some extent, except that we don't know why it works when it does :) See this slide
The heatmap shows how correlated is each well position across platemaps (i.e. same position, different compound correlations); whitening reduces this.
We will address the bigger question of whitening the repurposing data in a separate issue, but for this project, it may be too much of a hassle:
Note that both the Cell Health Cell Painting data as well as the Repurposing Cell Painting data will need to be whitened, together, to make this work. We need each experiment to have a sufficient number of DMSOs to do this. Cell Health has none, so we'd need to use the CRISPR controls as a proxy, and it would need a good deal of analysis to figure out whether that is a good proxy.
So in all, I think we should stick with the unwhitened data for this analysis.
The CPJUMP1 Pilot (JUMP-CP project) will help us figure out how to do this sort of joint whitening across different perturbation types effectively.
Why do the two datasets need to be whitened, together - wouldn't there be some benefit to doing one alone?
As to why whitening works to reduce plate layout effects: my guess is that it's because whatever the features are that make DMSOs look artificially unlike each other (across batches and from one position to the next) are the same as what make samples look unlike each other, and whitening suppresses those features' importance. But that sounds too simple - am I missing something?
Why do the two datasets need to be whitened, together - wouldn't there be some benefit to doing one alone?
Whitening transforms the feature space in a way that makes it incompatible with the original feature space, so models cannot be transferred if only one is transformed.
As to why whitening works to reduce plate layout effects: my guess is that it's because whatever the features are that make DMSOs look artificially unlike each other (across batches and from one position to the next) are the same as what make samples look unlike each other, and whitening suppresses those features' importance. But that sounds too simple - am I missing something?
That is correct.
I should have phrased that as – we don't know under what conditions whitening helps improve layout effects. E.g. We had to completely exclude the edge wells first, and then do whitening for the Biogen Pilot 1; not doing so made whitening worsen the data quality.
Hm, I thought one could whiten one data set and then apply the learnings to the other dataset (like learning weights on one set and applying those weights to a second dataset).
Hm, I thought one could whiten one data set and then apply the learnings to the other dataset (like learning weights on one set and applying those weights to a second dataset).
"Applying the learnings" in this case is in fact doing the whitening using the parameters learned in dataset 1.
Here's a bit of an oversimplification (my apologies if this is obvious!), but it captures the idea:
Let's say I wanted to z-score readouts in Plate 1, using DMSO as the reference. I'd compute the mean and s.d. of the DMSO wells, and then for each well in the plate, subtract the DMSO mean and divide by DMSO s.d.
Now lets say there's a Plate 2, which doesn't have DMSO (this is a CRISPR plate), and I want to "apply the learnings". Here this would mean subtracting the Plate 1 DMSO mean and dividing by Plate1 DMSO s.d.
This is not a wise thing to do because there may be plate-to-plate variations (let alone the fact that DMSO may not be a good reference with which to z-score CRISPR perturbations).
Whitening is a multivariate way of doing z-scoring, where the "multivariate" aspect comes because we first transform the data into a PCA-like space, where the new features are orthogonal to each other.
And this is why it becomes even more murky when we try to apply the learnings directly to Plate2 – the PCA spaces of the Plate 1 and Plate 2 will very likely not be aligned (which is why we need to do joint whitening through which you essentially make them aligned).
In #83, I add code that performs this update.
Summary
There appear to be substantial plate effects in the Repurposing Hub data, at least in UMAP space and using consensus DMSO data.
DMSO Figure
There are at least three distinct groupings of DMSO controls.
DMSO by Well
The clustering appears to be driven by plate layout to some extent.
Standard Deviation of Cell Health model scores for DMSO and Compounds
Should we expect a certain amount of variation across cell health model output scores? As expected, every cell health model has higher standard deviation in compounds compared to DMSO consensus (except two models that output the same score no matter what). Also, the amount of variance in the model outputs is directly associated with the performance. Note that only models with test set Rsquared > 0 were used in scaling. All models with test set Rsquared < 0 are shown with more transparency.
Next Steps
cc @AnneCarpenter @shntnu @hkhawar