Regressing out cell count

alxndrkalinin commented 1 year ago

1. Motivation

Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.

2. Approach

To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.

2.1 Constant and low count features

Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.

Effects of regressing out cell count on a constant feature

| Before | After | | ------------- | ------------- | | ![before](https://github.com/broadinstitute/position-effect-correction/assets/1107762/d0c4420e-da7b-4c31-8ebd-7d5720462975) | ![after](https://github.com/broadinstitute/position-effect-correction/assets/1107762/3f9e76d9-8bf5-45e4-8abf-ec8f9c2f4a20) |

We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.

Visualizing # of unique feature values vs cell count

![unique_vs_cc](https://github.com/broadinstitute/position-effect-correction/assets/1107762/57b053d3-f4bb-4b27-a307-2e7713d699c1)

2.2 Adding cell count back as a feature

After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the feature_select step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio ~0.06 (see visualization below), which is below the cutoff value unique_cut=0.1 that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of 0.01, which later was replaced by 0.1, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we use feature_selection with unique_cut=0.01, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.

Cell count unique values / sample size ratio

![unique_size_ratio](https://github.com/broadinstitute/position-effect-correction/assets/1107762/37a4cbca-3906-486e-afc5-1fd49d81b491)

3. Results

3.1 Same well, different ORF

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same well, diff ORF	raw->subset	0.0636	0.139 (51/368)
same well, diff ORF	raw->subset->cc adjust	0.0583	0.0217 (8/368)
same well, diff ORF	raw->subset->well correct	0.114	0.25 (92/368)
same well, diff ORF	raw->subset->cc adjust->well correct	0.379	0.723 (266/368)

Same well, different ORF plots

![same_well_diff_orf](https://github.com/broadinstitute/position-effect-correction/blob/8ab340f05949c0ac58798dbe896a174defe70d33/3.correct/output/mAP_visualizations/cell_count_adjusted/same_well_diff_pert.png?raw=true)

3.2 Same ORF, different well

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same ORF, diff well	raw->subset	0.00974	0.0 (0/37)
same ORF, diff well	raw->subset->cc adjust	0.0202	0.027 (1/37)
same ORF, diff well	raw->subset->well correct	0.0166	0.027 (1/37)
same ORF, diff well	raw->subset->cc adjust->well correct	0.00834	0.0 (0/37)

Same ORF, different well plots

![same_orf_diff_well](https://github.com/broadinstitute/position-effect-correction/blob/8ab340f05949c0ac58798dbe896a174defe70d33/3.correct/output/mAP_visualizations/cell_count_adjusted/same_pert_diff_well.png?raw=true)

3.2 Same ORF, same well

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same ORF, same well	raw->subset	0.195	0.903 (3297/3653)
same ORF, same well	raw->subset->cc adjust	0.0856	0.417 (1524/3653)
same ORF, same well	raw->subset->well correct	0.286	0.93 (3397/3653)
same ORF, same well	raw->subset->cc adjust->well correct	0.538	0.989 (3612/3653)

Same ORF, same well plots

![same_orf_diff_well](https://github.com/broadinstitute/position-effect-correction/blob/8ab340f05949c0ac58798dbe896a174defe70d33/3.correct/output/mAP_visualizations/cell_count_adjusted/same_well_same_pert.png?raw=true)

Observations:

cc adjustment corrects for plate effects better than well mean correction
a combination of both actually makes things WORSE (perhaps, due to an overcorrection)

alxndrkalinin commented 1 year ago

Update 5/31: results and visualizations were re-generated after removing some data after QC and making sure index isn't used as a feature when calculating metrics (see #16).

Overall, the conclusions hold: cell count adjustment seems to help, while subtracting well mean on top of that can hurt the replicate retrieval performance.

cc @shntnu @AnneCarpenter

AnneCarpenter commented 1 year ago

🎉 Thanks for getting that done lightning fast! Can we visualize what the pattern looks like for the mean in each well position? I’m curious if it’s the left right pattern.

It’s surprising to me that this correction worsens outcomes! -- Sent from my mobile phone

alxndrkalinin commented 1 year ago

Here are visualizations of mean values across the whole ORF dataset, for Cells_AreaShape_Area:

Raw profiles	Cell count adjusted

Well position corrected	CC adjust + well pos correct

for Cells_Intensity_MeanIntensity_AGP:

Raw profiles	Cell count adjusted

Well position corrected	CC adjust + well pos correct

Additionally, there are visualizations for Image_Threshold_SumOfEntropies_CellsIncludingEdges, which originally had high correlation with cell count:

before adjustment: 0.719
after adjustment: 1.77e-14

Raw profiles	Cell count adjusted

Well position corrected	CC adjust + well pos correct

AnneCarpenter commented 1 year ago

Questions from Arnaud

I am still wondering the impact of the regression part on classification/clustering algorithms. Did u quantify that with some ad-hoc generated data.
Also I am surprised that this step isn’t done after normalization (any reasons?).
For the correlation threshold to use to select features that will go through this process, u are choosing 0.4?

For the first Q: we've indeed been using something downstream for evaluation: ability to retrieve the same ORF reagent in a different well position (and usually different batch) - which we want to be high. And, ability to retrieve a different ORF in the same well position - which we want to be low. So far our results look bad and confusing and we are trying to figure out what is going on! perhaps @alxndrkalinin can address your other two Qs.

shntnu commented 1 year ago

Also I am surprised that this step isn’t done after normalization (any reasons?).

We argued that regression should be done as early in the processing pipeline to capture relationships between variables. Adjusting for plate-to-plate variation before regression does not seem desirable (not thought this through too deeply)

AnneCarpenter commented 1 year ago

On Alex's list is to expand the evaluation from the 37 duplicated ORF reagents (because who knows, perhaps there is something systematically weird about these reagents) to same-gene ORF reagents and even to same-GO-term reagents. It occurred to me that perhaps we ought to be using MOA matching in Target2 plates in each batch as another evaluation? It's compounds rather than ORFs but many technical variations/artifacts should have affected them the same (except the initial steps of virus production, etc)

alxndrkalinin commented 1 year ago

Update 6/6

Normalization

We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case. Is that something we want to reconsider or is it not super important? (cc @niranjchandrasekaran @johnarevalo who're also using it)
Per-position mean feature visualizations

After we discussed normalization, I updated plate visualizations of feature means above using normalized data. I also made additional per-position visualizations.

We discussed with @shntnu that cell count adjustment seems to mostly reduce top-down effect, but not the left-right one.

Cell Counts - Raw profiles	Cell Counts - Well position corrected

Since features are comparable when normalized, I also made plot for all features averaged per well position:

All features - Raw profiles	All features - Cell count adjusted

All features - Position corrected	All features - CC adjusted and position corrected

I also made similar visualization per feature type:

Cell features

| Cell features - Raw profiles | Cell features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/raw_profiles_Cells_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_profiles_Cells_Average_Feature_mean_per_well.png?raw=true) | | Cell features - Position corrected | Cell features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/well_mean_subtracted_Cells_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Cells_Average_Feature_mean_per_well.png?raw=true) |

Nuclei features

| Nuclei features - Raw profiles | Nuclei features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/raw_profiles_Nuclei_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_profiles_Nuclei_Average_Feature_mean_per_well.png?raw=true) | | Nuclei features - Position corrected | Nuclei features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/well_mean_subtracted_Nuclei_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Nuclei_Average_Feature_mean_per_well.png?raw=true) |

Cytoplasm features

| Cytoplasm features - Raw profiles | Cytoplasm features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/raw_profiles_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_profiles_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) | | Cytoplasm features - Position corrected | Cytoplasm features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/well_mean_subtracted_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) |

Image features

| Image features - Raw profiles | Image features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/raw_profiles_Image_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_profiles_Image_Average_Feature_mean_per_well.png?raw=true) | | Image features - Position corrected | Image features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/well_mean_subtracted_Image_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Image_Average_Feature_mean_per_well.png?raw=true) |

Retrieving same gene in different wells (instead of ORFs)

Given that there aren't many more unique gene symbols vs ORFs, plots and metrics for Same Well – Different ORF & Same Well – Same ORF remain the same as above.

Replacing ORFs with gene symbols, increases sample size, but produces metric contradictory to our previous conclusions

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same gene, diff well	raw->subset	0.0415	0.273 (72/264)
same gene, diff well	raw->subset->cc adjust	0.0262	0.121 (32/264)
same gene, diff well	raw->subset->well correct	0.0367	0.284 (75/264)
same gene, diff well	raw->subset->cc adjust->well correct	0.0278	0.152 (40/264)

Plots

![same_gene_diff_well](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/mAP_visualizations/gene_cell_count_adjusted/same_gene_diff_well.png?raw=true)

Removing Image features (per convo w/ @anne) improves retrieval

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same gene, diff well	raw->subset	0.0475	0.333 (88/264)
same gene, diff well	raw->subset->cc adjust	0.0311	0.155 (41/264)
same gene, diff well	raw->subset->well correct	0.0432	0.33 (87/264)
same gene, diff well	raw->subset->cc adjust->well correct	0.0304	0.152 (40/264)

Plots

![same_orf_diff_well](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/mAP_visualizations/no_image_gene_cell_count_adjusted/same_gene_diff_well.png?raw=true)

Preliminary results with PCA haven't showed any improvement in retrieval

PCA results

| Setting | Data | mmAP | Fraction retrieved (_p_<0.05) | | ------------- | ------------- | ------------- | ------------- | | same gene, diff well | raw->subset | 0.0338 | 0.227 (60/264) | | same gene, diff well | raw->subset->cc adjust | 0.015 | 0.00379 (1/264) | | same gene, diff well | raw->subset->well correct | 0.0187 | 0.0606 (16/264) | | same gene, diff well | raw->subset->cc adjust->well correct | 0.0189 | 0.0833 (22/264) | ![same_orf_diff_well](https://github.com/alxndrkalinin/position-effect-correction/blob/bb98ae5e7a167c727b0161c2344afd9dfd024885/3.correct/output/mAP_visualizations/pca_gene_cell_count_adjusted/same_gene_diff_well.png?raw=true)

niranjchandrasekaran commented 1 year ago

We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case.

@alxndrkalinin Do you mean the code doesn't separate the profiles by plates? If so, it is true that pycytominer does not normalize by plate in the typical profiling workflow. We separate the profiles by plate before doing RobustMAD normalization.

alxndrkalinin commented 1 year ago

Update 6/12

1. Normalization

I changed pre-processing to perform normalization per-plate and recalculated all downstream results. Note, that doing so resulted in some Inf values after RobustMAD because some Cytoplasm features have MAD=0 and our default epsilon=0.0. For plate visualizations, I removed those features, which should probably be done in preprocessing, so that's a TODO item.
2. Per-position mean feature visualizations

I updated plate visualizations of feature means using per-plate normalized data.

Cell Count

| Cell Counts - Raw profiles | Cell Counts - Well position corrected | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/raw_profiles_Cells_Count_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/well_mean_subtracted_Cells_Count_mean_per_well.png?raw=true) |

All features averaged

| All features - Raw profiles | All features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/raw_profiles_All_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_profiles_All_Average_Feature_mean_per_well.png?raw=true) | | All features - Position corrected | All features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/well_mean_subtracted_All_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_All_Average_Feature_mean_per_well.png?raw=true) |

Cell features

| Cell features - Raw profiles | Cell features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/raw_profiles_Cells_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_profiles_Cells_Average_Feature_mean_per_well.png?raw=true) | | Cell features - Position corrected | Cell features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/well_mean_subtracted_Cells_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Cells_Average_Feature_mean_per_well.png?raw=true) |

Nuclei features

| Nuclei features - Raw profiles | Nuclei features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/raw_profiles_Nuclei_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_profiles_Nuclei_Average_Feature_mean_per_well.png?raw=true) | | Nuclei features - Position corrected | Nuclei features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/well_mean_subtracted_Nuclei_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Nuclei_Average_Feature_mean_per_well.png?raw=true) |

Cytoplasm features

| Cytoplasm features - Raw profiles | Cytoplasm features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/raw_profiles_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_profiles_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) | | Cytoplasm features - Position corrected | Cytoplasm features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/well_mean_subtracted_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_profiles_Cytoplasm_Average_Feature_mean_per_well.png?raw=true) |

Image features

| Image features - Raw profiles | Image features - Cell count adjusted | | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/raw_profiles_Image_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_profiles_Image_Average_Feature_mean_per_well.png?raw=true) | | Image features - Position corrected | Image features - CC adjusted and position corrected| | ------------- | ------------- | | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/well_mean_subtracted_Image_Average_Feature_mean_per_well.png?raw=true) | ![](https://github.com/alxndrkalinin/position-effect-correction/blob/a766f894fb75d1a483922fdaca944eddc22cef4d/3.correct/output/plate_visualizations/cc_adj_well_mean_subtracted_Image_Average_Feature_mean_per_well.png?raw=true) |

3. Retrieving ORFs/genes with pre-plate normalization

WDN–whole dataset normalization PPN-IM–pre-plate normalization with image features PPN-NI–per-plate normalization without image features FR–fraction retrieved

3.1 Same well, different ORF/gene (lower is better, N=368)

Setting	Data	mmAP WDN	FR WDN (p<0.05)	mmAP PPN-IM	FR PPN-IM (p<0.05)	mmAP PPN-NI	FR PPN-NI (p<0.05)
same well, diff ORF	raw->subset	0.0636	0.139	0.148	0.899	0.164	0.938
same well, diff ORF	raw->subset->cc adjust	0.0583	0.0217	0.0919	0.543	0.0974	0.625
same well, diff ORF	raw->subset->well correct	0.114	0.25	0.241	0.864	0.184	0.69
same well, diff ORF	raw->subset->cc adjust->well correct	0.379	0.723	0.391	0.959	0.362	0.886

3.2v1 Same ORF, different well (higher is better, N=37)

Setting	Data	mmAP WDN	Fraction WDN (p<0.05)	mmAP PPN-IM	Fraction PPN-IM (p<0.05)
same ORF, different well	raw->subset	0.00974	0.0 (0/37)	0.0341	0.0811 (3/37)
same ORF, different well	raw->subset->cc adjust	0.0202	0.027 (1/37)	0.0314	0.0541 (2/37)
same ORF, different well	raw->subset->well correct	0.0166	0.027 (1/37)	0.0505	0.108 (4/37)
same ORF, different well	raw->subset->cc adjust->well correct	0.00834	0.0 (0/37)	0.0274	0.027 (1/37)

3.2v2 Same gene, different well (higher is better, N=264)

Setting	Data	mmAP WDN	FR WDN (p<0.05)	mmAP PPN-IM	FR PPN-IM (p<0.05)	mmAP PPN-NI	FR PPN-NI (p<0.05)
same gene, different well	raw->subset	0.0415	0.273	0.027	0.102	0.0351	0.121
same gene, different well	raw->subset->cc adjust	0.0262	0.121	0.0233	0.053	0.0257	0.0682
same gene, different well	raw->subset->well correct	0.0367	0.284	0.027	0.0606	0.0307	0.072
same gene, different well	raw->subset->cc adjust->well correct	0.0278	0.152	0.0225	0.0758	0.0225	0.0758

3.3 Same ORF/gene, same well (expected to drop, but not too much, N=3653)

Setting	Data	mmAP WDN	Fraction WDN (p<0.05)	mmAP PPN-IM	Fraction PPN-IM (p<0.05)	mmAP PPN-NI	Fraction PPN-NI (p<0.05)
same ORF, same well	raw->subset	0.195	0.903	0.211	0.778	0.232	0.79
same ORF, same well	raw->subset->cc adjust	0.0856	0.417	0.0863	0.346	0.102	0.405
same ORF, same well	raw->subset->well correct	0.286	0.93	0.307	0.857	0.268	0.794
same ORF, same well	raw->subset->cc adjust->well correct	0.538	0.989	0.412	0.929	0.396	0.903

3. Misc

Other experiments included PCA with and w/o image features and Cosine kernel PCA, but did not show improvement over the results above.

alxndrkalinin commented 1 year ago

In the 3.2v1 Same ORF, different well (higher is better, N=37) scenario, all 37 ORFs have 5x same-well replicates in one batch and 5x same-well replicates in another.

See an example

``` JCP2022_900041 Metadata_Batch Metadata_Plate Metadata_Well 6670 2021_06_14_Batch6 BR00123945 F08 8794 2021_06_14_Batch6 BR00124766 F08 9148 2021_06_14_Batch6 BR00124767 F08 9502 2021_06_14_Batch6 BR00124768 F08 9856 2021_06_14_Batch6 BR00124769 F08 10921 2021_07_12_Batch8 BR00124787 H13 11281 2021_07_12_Batch8 BR00124788 H13 12703 2021_07_12_Batch8 BR00125619 H13 13062 2021_07_12_Batch8 BR00125620 H13 13422 2021_07_12_Batch8 BR00125621 H13 ```

Whereas in the 3.2v2 Same gene, different well (higher is better, N=264) scenario, 33% of genes (89/264) come from a single batch and 1% (3/264) are from 3 batches:

# unique batches	# unique gene symbols	mmAP	Fraction retrieved (p<0.05)
2	172	0.0254	0.038 (10 / 264)
1	89	0.0300	0.06 (16 / 264)
3	3	0.0256	0.004 (1 / 264)

An example of a gene with all replicates in the same batch

``` ADGRG7 Metadata_JCP2022 Metadata_Batch Metadata_Plate Metadata_Well 17443 JCP2022_904157 2021_08_09_Batch11 BR00126538 H01 17539 JCP2022_908625 2021_08_09_Batch11 BR00126538 P01 17800 JCP2022_904157 2021_08_09_Batch11 BR00126539 H01 17896 JCP2022_908625 2021_08_09_Batch11 BR00126539 P01 18157 JCP2022_904157 2021_08_09_Batch11 BR00126540 H01 18253 JCP2022_908625 2021_08_09_Batch11 BR00126540 P01 18514 JCP2022_904157 2021_08_09_Batch11 BR00126541 H01 18610 JCP2022_908625 2021_08_09_Batch11 BR00126541 P01 18871 JCP2022_904157 2021_08_09_Batch11 BR00126542 H01 18967 JCP2022_908625 2021_08_09_Batch11 BR00126542 P01 ```

An example of a gene with 3 different batches and 4 well positions

``` ATM Metadata_JCP2022 Metadata_Batch Metadata_Plate Metadata_Well 1942 JCP2022_909964 2021_05_31_Batch2 BR00121552 D23 2210 JCP2022_909964 2021_05_31_Batch2 BR00121553 D23 2478 JCP2022_909964 2021_05_31_Batch2 BR00121554 D23 2746 JCP2022_909964 2021_05_31_Batch2 BR00121555 D23 3014 JCP2022_909964 2021_05_31_Batch2 BR00121556 D23 5109 JCP2022_909963 2021_05_17_Batch4 BR00123785 K01 5464 JCP2022_909963 2021_05_17_Batch4 BR00123786 K01 5819 JCP2022_909963 2021_05_17_Batch4 BR00123787 K01 6174 JCP2022_909963 2021_05_17_Batch4 BR00123790 K01 6529 JCP2022_909963 2021_05_17_Batch4 BR00123791 K01 6705 JCP2022_909963 2021_06_14_Batch6 BR00123945 D18 6706 JCP2022_909964 2021_06_14_Batch6 BR00123945 N22 8829 JCP2022_909963 2021_06_14_Batch6 BR00124766 D18 8830 JCP2022_909964 2021_06_14_Batch6 BR00124766 N22 9183 JCP2022_909963 2021_06_14_Batch6 BR00124767 D18 9184 JCP2022_909964 2021_06_14_Batch6 BR00124767 N22 9537 JCP2022_909963 2021_06_14_Batch6 BR00124768 D18 9538 JCP2022_909964 2021_06_14_Batch6 BR00124768 N22 9891 JCP2022_909963 2021_06_14_Batch6 BR00124769 D18 9892 JCP2022_909964 2021_06_14_Batch6 BR00124769 N22 ```

shntnu commented 1 year ago

(DRAFT)

Testing DINO features here: https://github.com/jump-cellpainting/morphmap/issues/91#issuecomment-1595712135

broadinstitute / position-effect-correction