Can single-cell-level classification score be used to determine variant impact?

shntnu commented 1 year ago

For context, see https://github.com/broadinstitute/2021_09_01_VarChAMP/issues/5#issuecomment-1642753159 and the comments that follow

Zitong-Chen-16 commented 1 year ago

Classification Results from Plate B1A1R1-P1T1

As a pilot test, I downloaded single cell profiles from one plate (B1A1R1-P1T1) and performed classification using XGBoostClassifier (n=100).

For test experiment, I ran classification on all of the reference-variant pairs (one classifier per pair), and recorded the F1 score of the results (Blue columns below).
Then I also ran classification for each pair of control replicates located on the plate (orange columns). Note that the 90% of the control distribution F1 score is already at 1.0.
On this plate there are 48 pairs of null controls (from 8 treatments) and 240 pairs of ref_var test profiles (from 35 genes).

variant_prediction.csv control_prediction.csv

shntnu commented 1 year ago

That was fast @Zitong-Chen-16 !

Some quick notes for now; more to follow

List of wells (for controls)

control %>% separate(Well_Pair, c("Well1", "Well2"), sep = "_") %>% select(Treatment, Well1, Well2) %>% pivot_longer(-Treatment, values_to = "Well") %>% select(Treatment, Well) %>% distinct() %>% group_by(Treatment) %>% arrange(Well) %>% summarise(Well = str_c(Well, collapse = ",")) %>% knitr::kable()

Treatment	Well
516 - TC	E05,E06,F05,F06
527 ALK R1275Q	I09,I10,J09,J10
527 ALK wt	G07,G08,H07,H08
527 MAPK9	K13,K14,L13,L14
527 PRKACB	E19,E20,F19,F20
527 RHEB	I15,I16,J15,J16
527 SLIRP	G17,G18,H17,H18
PTK2B	K11,K12,L11,L12

Boxplots

control %>% ggplot(aes(Treatment, `F1 Score`)) + geom_boxplot() + geom_dotplot(binaxis='y', binwidth = .002, fill = "red", color = "red") + theme_bw()

shntnu commented 1 year ago

@Zitong-Chen-16 What was the train/test split like? (Also, can you create a PR with your analysis notebook? Messy code is totally ok for now)

Zitong-Chen-16 commented 1 year ago

@Zitong-Chen-16 What was the train/test split like?

The test size is 0.3, and the random states is set to 1.

(Also, can you create a PR with your analysis notebook? Messy code is totally ok for now)

I just created a PR!

shntnu commented 1 year ago

The test size is 0.3, and the random states is set to 1.

Perfect

@Zitong-Chen-16 just to make sure there isn't some trivial issue with the setup, can you shuffle the labels and report the values?

shntnu commented 1 year ago

I am reasonably sure it is not a trivial issue, so just thinking ahead:

We'd want to know what features are driving the differences. Are they the same

across all pairs of control wells?
across the reference-variant pairs?

Zitong-Chen-16 commented 1 year ago

@Zitong-Chen-16 just to make sure there isn't some trivial issue with the setup, can you shuffle the labels and report the values?

Here are the results after shuffling labels. The distribution looks fairly normal and centered at 0.5 (random guessing).

variant_prediction.csv control_prediction.csv

AnneCarpenter commented 1 year ago

Exciting to be trying this approach! Shantanu listed the 8 "null controls" as Sam called them, but could someone elaborate on what those controls are intended to be/do? They look like a bunch of different genes (mostly WT/reference) so I'm not sure why we would expect them to all look the same, which is what I assume we had hoped/intended here?

I'm also curious what the prefixes are (516, 527).

renochlo commented 1 year ago

could someone elaborate on what those controls are intended to be/do?

MAPK9, RHEB, SLIRP, & PRKACB are the negative controls (labelled "NC" under the "node_type" column) which should show no morphological phenotypes. Since there is no variant/reference pair here, perhaps we want to ignore the EGFP (protein) channel here?

ALK & ALK R1275Q are positive controls for the protein channel (labelled "PC" under the "node_type" column) where variant R1275Q should mislocalize to the ER.

PTK2B is the positive control for morphological changes suggested by Anne in a previous GH issue.

516 - TC is the transfection/selection control (labelled "TC" under the "node_type" column) and should have no remaining cells left following selection.

Plate 4 of this screen should also contain genes labelled "CC" under the "node_type" column for candidate controls. It would be great if those could be analyzed as well. Since PTK2B was never tested in our system before, we included alternative controls to check out.

I'm also curious what the prefixes are (516, 527).

516 and 527 are the Taipale lab's numbering system for which vector these expression ORFs are in. They got carried over while merging our tables and will likely be excluded in the future.

Zitong-Chen-16 commented 1 year ago

Thank you for the explanation Chloe!

They look like a bunch of different genes (mostly WT/reference) so I'm not sure why we would expect them to all look the same, which is what I assume we had hoped/intended here?

To answer Anne's question: the null control in this experiment is training classifiers to distinguish the well location of the same control treatments that are located on the same plate. In other words, we train one classifier for each _Control1Well1 and _Control1Well2 pairs, so we are not comparing across different control treatments. Ideally, if we don't see a strong plate layout effect, the F1 score of this classification task should be quite low and we can use that as the null baseline. But in reality, the scores are pretty high and that's why we are concerned because it indicates strong plate layout effect.

Zitong-Chen-16 commented 1 year ago

To understand what features were most important during classification and whether they were consistent across control and experimental pairs, I saved the weight of each feature after training each model and counted the number of times each feature appeared as the Top 1 most important feature for a model. Below are the results.

Across all pairs of control wells?

There are 48 pairs of control wells, meaning we trained 48 models.
I counted the number of times (the "Count" column) each feature appeared as the most important feature for a model.
32 different features were used as the most important feature in different models ctrl_feat_count.csv
The top 10 features with the highest frequency are presented below.

Feature	Count
Cytoplasm_Intensity_MeanIntensity_AGP	7
Cytoplasm_Correlation_Costes_AGP_DNA	4
Cytoplasm_Intensity_LowerQuartileIntensity_AGP	3
Cytoplasm_Intensity_MeanIntensityEdge_AGP	2
Cells_Intensity_MeanIntensity_AGP	2
Cells_Intensity_UpperQuartileIntensity_AGP	2
Cytoplasm_Intensity_MedianIntensity_AGP	2
Cytoplasm_Texture_DifferenceEntropy_AGP_10_03_256	1
Cytoplasm_Correlation_Costes_GFP_Mito	1
Cytoplasm_Correlation_Costes_AGP_Mito	1

Across the reference-variant pairs?

There are 240 pairs of reference-variant, meaning we trained 240 models.

103 different features were used as the most important feature in different models var_feat_count.csv

Top 10 features:

Feature	Count
Cytoplasm_Intensity_MeanIntensity_AGP	14
Cytoplasm_Intensity_MADIntensity_AGP	12
Cytoplasm_Correlation_Costes_AGP_DNA	12
Nuclei_Intensity_LowerQuartileIntensity_Mito	11
Cytoplasm_Correlation_Costes_AGP_GFP	10
Cytoplasm_Intensity_MedianIntensity_AGP	8
Cytoplasm_Intensity_MinIntensityEdge_Mito	8
Cells_Intensity_MADIntensity_AGP	7
Cytoplasm_Intensity_UpperQuartileIntensity_AGP	6
Cytoplasm_Correlation_Costes_GFP_Mito	5

Thoughts: Many of the most important features are some measurements of AGP intensity (which I believe is the actin channel?). I wonder if this has something to do with the artifacts we discussed in the previous emails about the first batch (B1A1R1), where Erin was saying we saw "a LOT of little spots in (at least) the Actin channel. It's possible these are dye aggregates but could also be bacteria. "

Zitong-Chen-16 commented 1 year ago

I actually have a question for @AnneCarpenter and @shntnu: Should we separate the protein and non-protein channels when doing classification? We hadn't discussed this before the experiment but maybe that was obvious to everyone else except me... I understand that protein and non-protein features allow us to draw different conclusions about changes in morphology (i.e. protein mislocalization/stability vs. morphological change) but I wasn't sure if that was the goal or if we just wanted to state that the variants are different from the reference.
A related question is that with the current experimental settings, we are not really exploiting the information we knew a priori about the negative and positive controls. We are merely using the fact that there are replicates on the same plate to control for plate layout effect. I am listing the alternative experimental setting here just to help myself clarify and to make sure I understood the purpose of the different controls correctly:

We can use the four negative controls (MAPK9, RHEB, SLIRP, & PRKACB ) with no morphological phenotype as the null baseline for the non-protein channel features for reference vs. variants.
We can use the positive controls ALK & ALK R1275Q as the positive control for protein channel features.
We can also use PTK2B (classifying against the negative controls?) as the positive control for non-protein channel features.

I can't think of why this approach would be better than just comparing the features between control replicates other than maybe augmenting the size of null baselines but just putting it here.

I also have a question about the negative controls for @renochlo and sorry if this is basic biology: for the negative controls, is it true that we expect them to not no significant changes in cell morphology (indicated through non-protein channel features) and have no signals in the protein channel (this is the part I am confused about)?

AnneCarpenter commented 1 year ago

Re Q1: yes please do separate the protein and non-protein channels when doing classification - of course, one might have better sensitivity if combining all data available (so it's not a bad idea), but we are quite interested in knowing how each variant impacts protein function and clearly defining whether it's the protein localization vs the rest of cell morphology is a big piece of that. I think if a given variant is not significant for either protein localization nor cell morph but IS significant when combining all channels we would be worried we are scraping the bottom of the barrel and in the range of statistical noise so it's ok to "lose" those variants we would otherwise be able to get.

Re Q2: I agree with your 3 bullets (and yes, PTK2B would be against neg controls); these are all reasonable controls to run. I'm not sure I understand your overall question - I think you're saying we are NOT planning to do any of those 3 bullets and we should reconsider - if so , I agree! Or perhaps better stated, I may need a reminder if what we are doing instead - you said "just comparing the features between control replicates" but I'm not sure what it means.

Q3: I believe these neg controls are thought to NOT impact the non-protein channels. I suspect they are tagged proteins though, so there will be fluorescence associated with the protein channel. Without variant pairs, they can't be pos or neg controls for having a change in localization due to the variant. Hopefully Chloe can confirm the tag status.

renochlo commented 1 year ago

3. I also have a question about the negative controls for @renochlo and sorry if this is basic biology: for the negative controls, is it true that we expect them to not no significant changes in cell morphology (indicated through non-protein channel features)

If low replicate correlation from cpjump1 indicates no significant morphology changes, then yes. PRKACB, RB1 , MAPK9, and RHEB were recommended by Marzieh as negative controls in this GH issue. However, we had issues growing RB1. We observed large inconsistent deletions within the gene during cloning and therefore replaced it with SLIRP (I am not sure why we chose SLIRP now, seems like we should have chosen an ORF with a lower correlation - please let us know if this poses a problem and we will replace).

and have no signals in the protein channel (this is the part I am confused about)?

There should be a signal in the protein channel as the proteins are still tagged with GFP, however this channel is irrelevant for the negative controls and should be ignored. It just serves to replicate the conditions/vector that all other treatments received. The comparison for the protein channel would be reference vs. variant. (We considered dropping the GFP-tag for the negative controls, however, that caused errors during image acquisition. Plus that would mean cells are receiving a different construct)

Zitong-Chen-16 commented 1 year ago

Thank you for the thorough explanations @AnneCarpenter and @renochlo!

Re Q1: yes please do separate the protein and non-protein channels when doing classification {...}

That makes a lot of sense! I am working on that now.

Re Q2: {...} I'm not sure I understand your overall question - I think you're saying we are NOT planning to do any of those 3 bullets and we should reconsider - if so, I agree!

Sorry about the confusion and yes, I thought we were not testing those controls in this manner so thank you for clarifying.

Or perhaps better stated, I may need a reminder if what we are doing instead - you said "just comparing the features between control replicates" but I'm not sure what it means.

IIUC, in our last meeting, we decided to set the null task as classifying the well location of the same control treatment. In other words, we were only comparing the replicates of the same control treatments (on the same plate) but not comparing across different control treatments. This is mainly to account for the plate layout effect, with the null hypothesis that the classifier is able to distinguish between reference vs. variants because of well location rather than phenotype. So if we show that the classifiers cannot distinguish the same treatments on different wells, we can then reject the null hypothesis. I think this is still a good null baseline to have regardless of what the specific controls are since we should expect the same treatments to have the same phenotypes on both protein and non-protein channels.

The bullet points outlined above would then mean we run the following additional experiments:

Neg con for non-protein channels: classify negative control treatments against each other (on non-protein features). We expect low accuracy since the hypothesis is that all four neg controls should have no significant morphological changes and thus should look the same.
Pos con for non-protein channels: classify PTK2B vs. negative controls on non-protein channels.
Pos con for protein channel: classify ALK vs. ALK R1275Q on protein channel (and maybe non-protein channel as well just to see). There are no Neg con for protein channel because as Chloe mentioned the protein channels of negative controls are not meaningful without variant pairs.

Zitong-Chen-16 commented 1 year ago

Result Update: Protein and non-protein channel classification on normalized feature-selected profiles

Previously the features used for classification were neither normalized nor feature-selected. Here I used mad_robustize to normalize the features and then performed plate-level feature selection (# of features=684). Then I ran the same classification pipeline as before.

Macro-averaged F1 Score

Boxplot of Control Treatments

The general observation is that performing feature selection drags down the performance of both the null controls and the reference vs. variant experiments but we still don't see a separation between the distribution of the two groups (see Macro-averaged F1 score).

Additionally, the negative controls seem to have a very strong plate layout effect judging from the high F1 score in both protein and non-protein classifications (see Boxplot). I investigated which features are the most important for differentiating the negative controls, summarized in the tables:

Table: Top 10 features for Neg Con (Protein)

Table: Top 10 features for Neg Con (Non-Protein)

What stands out to me is again how AGP seems to dominate the classifications. I wonder if this has something to do with the (suspected) actin aggregates we observed in Batch 1. Here I include an image of the actin channel from one of the TC wells. I am not very familiar with how these images should look so it would be great if someone could help me double check:

Actin Channel Image of TC

Zitong-Chen-16 commented 1 year ago

I also noticed something a bit weird that maybe @ErinWeisbart could help me with. In the transfection control (TC) wells, there should not be any cells remaining, but CellProfiler did still identify some of the cell debris as cells and those are included in our experiments as well. For instance, Well E5 and E6 from plate 1 have 35 and 32 profiles respectively. I am wondering what is the best practice for dealing with those - should I just exclude them or is there any normalization I can do at this point to use those as the baseline of dead cells to filter out other potentially dead cells in non-TC wells? I don't know if the fact that there are profiles in TC well indicates that there could be other untransfected cells in other wells that are messing up with the population distribution.

ErinWeisbart commented 1 year ago

Here I include an image of the actin channel from one of the TC wells. I am not very familiar with how these images should look so it would be great if someone could help me double check:

I only skimmed the most recent posts, but I would strongly suspect the aggregates in the Actin channel are throwing things off. There should be no punctae in that channel and there are very bright punctae that are stronger than the actual channel signal in the image that you shared. One tip for knowing that the staining isn't as it should be - you can see the same pattern across the background and cells alike. The background should never have patterns in it (depending on the signal:noise you might see a light haze but it should never have discernible structure).

I would suggest removing all Actin features and seeing what performance is.

CellProfiler did still identify some of the cell debris as cells and those are included in our experiments as well. For instance, Well E5 and E6 from plate 1 have 35 and 32 profiles respectively. I am wondering what is the best practice for dealing with those - should I just exclude them or is there any normalization I can do at this point to use those as the baseline of dead cells to filter out other potentially dead cells in non-TC wells?

There are a number of different (non-exclusive) ways you can handle the cell debris. 1) ignore it (frankly, this is what we usually do, though without some actual looking I can't say whether the level of dead cell identification is higher in this dataset than other datasets) 2) rerun analysis with segmentation parameters that do a better job of filtering out dead cells (could be setting stricter minimum cell object size limits) and/or adjust the parameters on batches going forward 3) build some sort of classifier to filter out the cell debris from the single cell data in all wells. You could probably do this pretty quickly and easily in Cell Profiler Analyst. We have tutorials. You could start by doing this on a single plate and see how much it improves the signal to know if it's worth doing it across the batch.

Zitong-Chen-16 commented 1 year ago

There should be no punctae in that channel and there are very bright punctae that are stronger than the actual channel signal in https://github.com/broadinstitute/2021_09_01_VarChAMP/issues/12#issuecomment-1699907797.

Thank you for checking @ErinWeisbart and thanks for the suggestions on cell debris as well. I will @renochlo here so she is aware of the issue. Chloe, the image I shared is from the actin channel of a transfection control well in B1A1R1, and we are concerned that the actin aggregates are messing up with the downstream analysis. I am not sure if this is still something we need to worry about if we do indeed replace the actin channel with the new transfection control FP as discussed in yesterday's meeting, but just wanna keep you guys posted.

renochlo commented 1 year ago

I see, those are very bright. We did track down these artefacts in the AGP channel to be a result of contamination from our automated plate filler during staining. We've updated our protocol to include extra decontamination steps (specified here) as well as allocated separate equipment for VarChAMP staining use only to prevent future cross-contamination. We hope this will not be problem in the future.

We are currently re-running these alleles (B03A01-A02) with longer selection (as there shouldn't have been cells remaining in 516 - TC) and better staining.

Zitong-Chen-16 commented 1 year ago

Awesome, thank you for the update, Chloe!

Zitong-Chen-16 commented 1 year ago

To find out whether the strong plate layout effect I observed is specific to the Varchamp dataset or due to the supervised method, I am using this single-cell classification pipeline on one Target 2 plate BR00121425 from the JUMP ORF dataset.

In this experiment, the null controls are the 64 DMSO wells and the classifiers are trained to predict their well locations. Similar to the varchamp experiment, we train one classifier for each pair of DMSO_Well1 and DMSO_Well2 and report their F1 score. Since there are 64 unique wells and we pair each well against each other, there are 2016 unique pairs.
The experimental groups consist of 64 compounds randomly sampled from the 306 compounds on the plate to match the sample size of DMSO. I trained classifiers to distinguish single cells from Compound1_Well1 v.s. Compound2 _Well2. There are also 2016 unique pairs.

Results 95% percentil of null experiment = 0.9658994032395568 1087 out of 2016 compound pairs pass the 95% threshold (has higher F1 score).

Barplot of DMSO well classification v.s. Compound classification

Shantanu also suggested comparing supervised v.s. unsupervised method on the DMSO control wells by comparing the F1 classification score to the cosine distance (1 - cosine similarity) between each pair of wells. The goal is to see if there is a consistent pattern between the two heatmaps. If so, the classifiers are likely picking up on actual differences between the profiles instead of benefiting from data leakage, etc. Below are the two heat maps. We do observe some patterns between the prediction score and the cosine distance: well pairs with high classification scores also have high distance which is intuitive.

Heatmap comparing supervised v.s. unsupervised methods

Conclusion The null control for the JUMP Target 2 plate (DMSO at different well locations) shows a somewhat different distribution compared to the experimental group with median at around 0.87, though they are not centered around 0.5 as in the shuffled experiment. About half of the compound pairs pass the 95% threshold. Note that compared to the Varchamp experiment, we have a lot more pairs of null controls on one plate in this experiment (2016 in jump v.s. 48 in varchamp).

AnneCarpenter commented 1 year ago

Looking forward to digging in on this! Note that column 1 and 24 are the edge columns, and A and P are edge rows. Some of the red stripes are 24's which may be because of plate layout effects (ie readily distinguished from other compounds because of its location on the plate not because of its identity) so that makes sense (even if we prefer it to not happen!) I don't have a good explanation for the C09 and D09 - maybe just noise/artifact affecting those two wells, or maybe they have a dramatic unique phenotype.

I do wonder if we ought to throw out A/P and 1/24 (I wonder if DMSOs are never in these locations so our null doesn't ever samples those kinds of edge effects but the real samples do)

All that is sort of a side note - I look forward to talking overall about these results!

Zitong-Chen-16 commented 12 months ago

Results Update: In-plate classification ONLY for both null and ref-var experiments

Upon some investigation, I realized that our previous comparison between null and ref-var was unfair. Specifically, the null experiments are performed within each plate, whereas the reference vs. variants experiments are performed among four replicate plates at the same time. To correct for fairness, I rewrote the pipeline such that we are only performing in-plate classification for both null and ref-var. The result histogram is included below. Here, we finally observed a separation between the null and experimental distribution, but the median of null is still clearly above 0.5.

Histogram of null vs. ref-var showing separate distributions

To leverage this issue, we decided to try a stratified approach for both null and ref-var, in which we train using data from 3 replicate plates and test on 1 held-out plate. So our F1 scores are based on a stratified k-fold cross-validation (where we stratify by plate). For example, to train a classifier for MAPK9-Plate1T1-A01 vs. MAPK9-Plate1T1-B02, we train model to predict A01 vs. B02 using data from MAPK9-Plate1T2, MAPK9-Plate1T3, and MAPK9-Plate1T4, before testing on MAPK9-Plate1T1. And the same applies to ref-var. With this approach, we were able to obtain a null distribution centered at 0.5.

Histogram of null vs. ref-var stratified by plate

We therefore conclude that using a stratified approach is likely most suitable for our problem. In an internal discussion, @shntnu gave an intuitive explanation for why we are getting 0.5 centered null with this approach.

Shantanu's explanation

Previously, during test, 1. we never have seen the cell in training, but 2. we always have seen the well in training, and 3. we always have seen the well position in training (trivially so, because of 2) So the null we have built is to answer the Q: how likely is it that I would be able to tell apart these two wells This is >>50% But now during test, 1. we never have seen the cell in training (this is always true), and 2. we never have seen the well in training, and 3. we always have seen the well position in training So the null we have built is to answer the Q: how likely is it that I would be able to tell apart these two well positions This is ~50%

Current concerns and next steps

The variants identified as hits with this approach likely have a high false positive rate due to several reasons.

Currently, we set the threshold for hit-calling as the 95% of null F1-score, resulting in a false positive rate of ~20%
- Solution: set threshold by calculating the p-value of ref-var distribution w.r.p to null, then take the mean of negative log of p-value, and perform a search for thresholding p-value such that the false positive rate is 5% (or as small as possible).
Our current platemap is designed such that the control replicates are near one another while the ref-var pairs are farther away on one plate.
- Solution: discuss with Taipale lab for potential rearrangement of platemap such that we either 1) move controls far away from each other and/or on edge wells, 2) move ref-var close to each other, or 3) scramble the locations of controls and non-controls across replicate plates, so that the control and ref-var pairs match.
Our current preprocessing of profiles only contains mad robustization and feature selection, not taking into account the strong plate layout effect we are observing. Additionally, we are concerned that we are only identifying hits by the number of cells surviving instead of phenotypes at single cell level.
- Solution: Perform well location correction, cell count regression, and another step to filter out protein non-expressing individual cells.

Zitong-Chen-16 commented 6 months ago

Result update for Batch 7 and 8

Now that the Snakemake pipeline is ready and Jess has confirmed using Batch 6 data that the plate layout effect cannot explain the shift in distribution we previously observed, we now feel confident with processing the 1% variants using the single-cell classification approach we have developed. For the 1% variants, we have two biological replicates, Batch 7 and Batch 8, and four technical replicates within each batch (e.g. P1T1 - P1T4 are four technical replicates with the same platemap). Below are the results from Batch 7, and I am finishing up the analysis with Batch 8.

Histogram of CON-CON (Null) vs. REF-VAR Single-Cell Classification

- We trained a classifier for each pair of REF-VAR on one plate. The classifier is trained using data from the three other technical replicates and tested on one replicate (held-out plate). This is the blue curve. - We also trained a classifier for each pair of **the same control treatment** (CON-CON) on the same plate in different well locations. Again the classifier is trained with held-out plates. This is the orange curve. - We recorded the PR-AUC score of the classifier, and averaged per plate-map to get one value per REF-VAR pair and reported the results.

Histogram of CON-CON (Null) vs. REF-VAR Well-level Correlation Distribution

Since we were previously unsure about using single-cell methods, I computed pair-wise cosine distance for well-level profile as an alternative hit-calling method. - We computed cosine distance for each pair of REF-VAR on one plate. This is the blue curve. - We also computed cosine distance for each pair of CON-CON on the same plate in different well locations. This is the orange curve. - Cosine distance = 1 - cosine similarity. It ranges from 0 to 2. Higher distance means larger shift in distribution and more different profiles. - We then computed averaged the cosine distance per platemap to get one distance per REF-VAR pair.

Comparison of Single-Cell v.s. Well-level Methods

- Here we compare the hits we call by using single-cell classification (Y-axis) and well-level cosine distance (X-axis). For both PR-AUC and Cosine distance, higher values are better. The threshold is set to be 95% in both experiments. - Overall, single-cell classification is picking up more variants than well-level profiles. I still need to look at the images to confirm whether the positive results indeed are caused by phenotypes.

Summary of results

Single-cell classification identified 23.7% of variants as inducing phenotype using protein channel features, and 9.2% using non-protein (and non-brightfield) features. Well-level correlation identified 8.4% using protein channel features, and 1.86% using non-protein channel features.
Comparing protein v.s. non-protein channel features, we observe distribution shift in both, but the pattern is much more salient for protein features than for non-protein ones. @jessica-ewald observed a similar pattern in the Batch 6 data and she is looking into the specific features that might have caused this effect!
Comparing well-level v.s. single-cell approach, we see that single-cell approach is much more sensitive and returned more positive results than well-level approach. The hits returned using either methods have some but not much overlap, and the validity of the results still needs to be confirmed by looking at the images.

Method update

Since the last update, I have been working on reimplementing the pipeline using Snakemake and optimizing it by trying different preprocessing steps. The pipeline, starting from per plate single-cell .sqlite files to final classification results, is now completed and can be run with one single command line instruction. Link: 9.downstream_analysis_snakmake (will rename after clean-up).

The current pipeline includes the following preprocessing steps:

Convert .sqlite to .parquet using CytoTable
Annotate metadata with platemaps
Aggregate per-plate profiles to one batch
Drop transfection controls and remove NaN features
Select variant features (i.e. filtered out features that have mad == 0 or abs_coef_var>1e-3 in any plate.)
RobustMAD by plate
Remove outliers in feature values (i.e. remove column when the 99th percentile of absolute values is larger than 100, and clip feature values to +-100)
Pycytominer feature selection (i.e. variance threshold, correlation_threshold, blocklist, drop_na_columns).

DAG graph of pipeline

jessica-ewald commented 6 months ago

Thanks, @Zitong-Chen-16 !

I'm curious about the CON-CON classifiers that gave good results:

In the two batch 6 scenarios (ALK WT-VAR, MAPK9-LPAR1), there is no second peak, just a single gaussian-looking peak slightly to the right of 0.5. Of course, 2 scenarios isn't much, but I'm wondering if the high performing CON-CON classifiers have any kind of obvious pattern (ie. all the same control, very different cell viability across plates, etc).

Zitong-Chen-16 commented 6 months ago

@jessica-ewald great observation! I was curious about this too. I highly suspect that this is caused by cell count. From a previous analysis, many of the good results in Null is caused by an extremely unbalanced class size, in other words, one well has hundreds of cells while the other have only a few. To account for this, I have added a step in the pipeline to filter out wells with fewer than 8 cells (the max number from Transfection control wells), and also added in a parameter in the classifier to account for class imbalance, but it's quite possible that some low cell count wells remain in the Null experiment. I am planning to look into this more and will report back.

AnneCarpenter commented 6 months ago

Exciting to see these results rolling out! some comments...

It makes a lot of sense (to me) that protein channel effects are more commonly seen than morphology effects. I would just guess they’d be more common (I think this variant set is somewhat random and not enriched for known localization defects, but even so I would imagine localization will be more common than more general impact on morphology).

I think it will be great to look at images from different regions of the single- vs well-level scatterplot to see what is the story and if it all makes sense (vs weird technical effects).

broadinstitute / 2021_09_01_VarChAMP