Open alxndrkalinin opened 1 year ago
Update 5/31: results and visualizations were re-generated after removing some data after QC and making sure index isn't used as a feature when calculating metrics (see #16).
Overall, the conclusions hold: cell count adjustment seems to help, while subtracting well mean on top of that can hurt the replicate retrieval performance.
cc @shntnu @AnneCarpenter
🎉 Thanks for getting that done lightning fast! Can we visualize what the pattern looks like for the mean in each well position? I’m curious if it’s the left right pattern.
It’s surprising to me that this correction worsens outcomes! -- Sent from my mobile phone
Here are visualizations of mean values across the whole ORF dataset, for Cells_AreaShape_Area
:
Raw profiles | Cell count adjusted |
---|---|
Well position corrected | CC adjust + well pos correct |
---|---|
for Cells_Intensity_MeanIntensity_AGP
:
Raw profiles | Cell count adjusted |
---|---|
Well position corrected | CC adjust + well pos correct |
---|---|
Additionally, there are visualizations for Image_Threshold_SumOfEntropies_CellsIncludingEdges
, which originally had high correlation with cell count:
Raw profiles | Cell count adjusted |
---|---|
Well position corrected | CC adjust + well pos correct |
---|---|
Questions from Arnaud
For the first Q: we've indeed been using something downstream for evaluation: ability to retrieve the same ORF reagent in a different well position (and usually different batch) - which we want to be high. And, ability to retrieve a different ORF in the same well position - which we want to be low. So far our results look bad and confusing and we are trying to figure out what is going on! perhaps @alxndrkalinin can address your other two Qs.
- Also I am surprised that this step isn’t done after normalization (any reasons?).
We argued that regression should be done as early in the processing pipeline to capture relationships between variables. Adjusting for plate-to-plate variation before regression does not seem desirable (not thought this through too deeply)
On Alex's list is to expand the evaluation from the 37 duplicated ORF reagents (because who knows, perhaps there is something systematically weird about these reagents) to same-gene ORF reagents and even to same-GO-term reagents. It occurred to me that perhaps we ought to be using MOA matching in Target2 plates in each batch as another evaluation? It's compounds rather than ORFs but many technical variations/artifacts should have affected them the same (except the initial steps of virus production, etc)
Update 6/6
We have previously said that RobustMAD
normalizes per-plate. I just checked and it does not seem to be the case. Is that something we want to reconsider or is it not super important? (cc @niranjchandrasekaran @johnarevalo who're also using it)
After we discussed normalization, I updated plate visualizations of feature means above using normalized data. I also made additional per-position visualizations.
We discussed with @shntnu that cell count adjustment seems to mostly reduce top-down effect, but not the left-right one.
Cell Counts - Raw profiles | Cell Counts - Well position corrected |
---|---|
Since features are comparable when normalized, I also made plot for all features averaged per well position:
All features - Raw profiles | All features - Cell count adjusted |
---|---|
All features - Position corrected | All features - CC adjusted and position corrected |
---|---|
I also made similar visualization per feature type:
Given that there aren't many more unique gene symbols vs ORFs, plots and metrics for Same Well – Different ORF
& Same Well – Same ORF
remain the same as above.
Setting | Data | mmAP | Fraction retrieved (p<0.05) |
---|---|---|---|
same gene, diff well | raw->subset | 0.0415 | 0.273 (72/264) |
same gene, diff well | raw->subset->cc adjust | 0.0262 | 0.121 (32/264) |
same gene, diff well | raw->subset->well correct | 0.0367 | 0.284 (75/264) |
same gene, diff well | raw->subset->cc adjust->well correct | 0.0278 | 0.152 (40/264) |
Setting | Data | mmAP | Fraction retrieved (p<0.05) |
---|---|---|---|
same gene, diff well | raw->subset | 0.0475 | 0.333 (88/264) |
same gene, diff well | raw->subset->cc adjust | 0.0311 | 0.155 (41/264) |
same gene, diff well | raw->subset->well correct | 0.0432 | 0.33 (87/264) |
same gene, diff well | raw->subset->cc adjust->well correct | 0.0304 | 0.152 (40/264) |
We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case.
@alxndrkalinin Do you mean the code doesn't separate the profiles by plates? If so, it is true that pycytominer does not normalize by plate in the typical profiling workflow. We separate the profiles by plate before doing RobustMAD
normalization.
Update 6/12
I changed pre-processing to perform normalization per-plate and recalculated all downstream results. Note, that doing so resulted in some Inf
values after RobustMAD
because some Cytoplasm
features have MAD=0
and our default epsilon=0.0
. For plate visualizations, I removed those features, which should probably be done in preprocessing, so that's a TODO item.
I updated plate visualizations of feature means using per-plate normalized data.
WDN–whole dataset normalization PPN-IM–pre-plate normalization with image features PPN-NI–per-plate normalization without image features FR–fraction retrieved
Setting | Data | mmAP WDN | FR WDN (p<0.05) | mmAP PPN-IM | FR PPN-IM (p<0.05) | mmAP PPN-NI | FR PPN-NI (p<0.05) |
---|---|---|---|---|---|---|---|
same well, diff ORF | raw->subset | 0.0636 | 0.139 | 0.148 | 0.899 | 0.164 | 0.938 |
same well, diff ORF | raw->subset->cc adjust | 0.0583 | 0.0217 | 0.0919 | 0.543 | 0.0974 | 0.625 |
same well, diff ORF | raw->subset->well correct | 0.114 | 0.25 | 0.241 | 0.864 | 0.184 | 0.69 |
same well, diff ORF | raw->subset->cc adjust->well correct | 0.379 | 0.723 | 0.391 | 0.959 | 0.362 | 0.886 |
Setting | Data | mmAP WDN | Fraction WDN (p<0.05) | mmAP PPN-IM | Fraction PPN-IM (p<0.05) |
---|---|---|---|---|---|
same ORF, different well | raw->subset | 0.00974 | 0.0 (0/37) | 0.0341 | 0.0811 (3/37) |
same ORF, different well | raw->subset->cc adjust | 0.0202 | 0.027 (1/37) | 0.0314 | 0.0541 (2/37) |
same ORF, different well | raw->subset->well correct | 0.0166 | 0.027 (1/37) | 0.0505 | 0.108 (4/37) |
same ORF, different well | raw->subset->cc adjust->well correct | 0.00834 | 0.0 (0/37) | 0.0274 | 0.027 (1/37) |
Setting | Data | mmAP WDN | FR WDN (p<0.05) | mmAP PPN-IM | FR PPN-IM (p<0.05) | mmAP PPN-NI | FR PPN-NI (p<0.05) |
---|---|---|---|---|---|---|---|
same gene, different well | raw->subset | 0.0415 | 0.273 | 0.027 | 0.102 | 0.0351 | 0.121 |
same gene, different well | raw->subset->cc adjust | 0.0262 | 0.121 | 0.0233 | 0.053 | 0.0257 | 0.0682 |
same gene, different well | raw->subset->well correct | 0.0367 | 0.284 | 0.027 | 0.0606 | 0.0307 | 0.072 |
same gene, different well | raw->subset->cc adjust->well correct | 0.0278 | 0.152 | 0.0225 | 0.0758 | 0.0225 | 0.0758 |
Setting | Data | mmAP WDN | Fraction WDN (p<0.05) | mmAP PPN-IM | Fraction PPN-IM (p<0.05) | mmAP PPN-NI | Fraction PPN-NI (p<0.05) |
---|---|---|---|---|---|---|---|
same ORF, same well | raw->subset | 0.195 | 0.903 | 0.211 | 0.778 | 0.232 | 0.79 |
same ORF, same well | raw->subset->cc adjust | 0.0856 | 0.417 | 0.0863 | 0.346 | 0.102 | 0.405 |
same ORF, same well | raw->subset->well correct | 0.286 | 0.93 | 0.307 | 0.857 | 0.268 | 0.794 |
same ORF, same well | raw->subset->cc adjust->well correct | 0.538 | 0.989 | 0.412 | 0.929 | 0.396 | 0.903 |
Other experiments included PCA with and w/o image features and Cosine kernel PCA, but did not show improvement over the results above.
In the 3.2v1 Same ORF, different well (higher is better, N=37)
scenario, all 37 ORFs have 5x same-well replicates in one batch and 5x same-well replicates in another.
Whereas in the 3.2v2 Same gene, different well (higher is better, N=264)
scenario, 33% of genes (89/264) come from a single batch and 1% (3/264) are from 3 batches:
# unique batches | # unique gene symbols | mmAP | Fraction retrieved (p<0.05) |
---|---|---|---|
2 | 172 | 0.0254 | 0.038 (10 / 264) |
1 | 89 | 0.0300 | 0.06 (16 / 264) |
3 | 3 | 0.0256 | 0.004 (1 / 264) |
(DRAFT)
Testing DINO features here: https://github.com/jump-cellpainting/morphmap/issues/91#issuecomment-1595712135
1. Motivation
Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.
2. Approach
To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.
2.1 Constant and low count features
Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.
Effects of regressing out cell count on a constant feature
| Before | After | | ------------- | ------------- | | ![before](https://github.com/broadinstitute/position-effect-correction/assets/1107762/d0c4420e-da7b-4c31-8ebd-7d5720462975) | ![after](https://github.com/broadinstitute/position-effect-correction/assets/1107762/3f9e76d9-8bf5-45e4-8abf-ec8f9c2f4a20) |We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.
Visualizing # of unique feature values vs cell count
![unique_vs_cc](https://github.com/broadinstitute/position-effect-correction/assets/1107762/57b053d3-f4bb-4b27-a307-2e7713d699c1)2.2 Adding cell count back as a feature
After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the
feature_select
step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio~0.06
(see visualization below), which is below the cutoff valueunique_cut=0.1
that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of0.01
, which later was replaced by0.1
, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we usefeature_selection
withunique_cut=0.01
, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.Cell count unique values / sample size ratio
![unique_size_ratio](https://github.com/broadinstitute/position-effect-correction/assets/1107762/37a4cbca-3906-486e-afc5-1fd49d81b491)3. Results
3.1 Same well, different ORF
Same well, different ORF plots
![same_well_diff_orf](https://github.com/broadinstitute/position-effect-correction/blob/8ab340f05949c0ac58798dbe896a174defe70d33/3.correct/output/mAP_visualizations/cell_count_adjusted/same_well_diff_pert.png?raw=true)3.2 Same ORF, different well
Same ORF, different well plots
![same_orf_diff_well](https://github.com/broadinstitute/position-effect-correction/blob/8ab340f05949c0ac58798dbe896a174defe70d33/3.correct/output/mAP_visualizations/cell_count_adjusted/same_pert_diff_well.png?raw=true)3.2 Same ORF, same well
Same ORF, same well plots
![same_orf_diff_well](https://github.com/broadinstitute/position-effect-correction/blob/8ab340f05949c0ac58798dbe896a174defe70d33/3.correct/output/mAP_visualizations/cell_count_adjusted/same_well_same_pert.png?raw=true)Observations: