Open shntnu opened 1 year ago
As a pilot test, I downloaded single cell profiles from one plate (B1A1R1-P1T1) and performed classification using XGBoostClassifier (n=100).
That was fast @Zitong-Chen-16 !
Some quick notes for now; more to follow
List of wells (for controls)
control %>% separate(Well_Pair, c("Well1", "Well2"), sep = "_") %>% select(Treatment, Well1, Well2) %>% pivot_longer(-Treatment, values_to = "Well") %>% select(Treatment, Well) %>% distinct() %>% group_by(Treatment) %>% arrange(Well) %>% summarise(Well = str_c(Well, collapse = ",")) %>% knitr::kable()
Treatment | Well |
---|---|
516 - TC | E05,E06,F05,F06 |
527 ALK R1275Q | I09,I10,J09,J10 |
527 ALK wt | G07,G08,H07,H08 |
527 MAPK9 | K13,K14,L13,L14 |
527 PRKACB | E19,E20,F19,F20 |
527 RHEB | I15,I16,J15,J16 |
527 SLIRP | G17,G18,H17,H18 |
PTK2B | K11,K12,L11,L12 |
Boxplots
control %>% ggplot(aes(Treatment, `F1 Score`)) + geom_boxplot() + geom_dotplot(binaxis='y', binwidth = .002, fill = "red", color = "red") + theme_bw()
@Zitong-Chen-16 What was the train/test split like? (Also, can you create a PR with your analysis notebook? Messy code is totally ok for now)
@Zitong-Chen-16 What was the train/test split like?
The test size is 0.3, and the random states is set to 1.
(Also, can you create a PR with your analysis notebook? Messy code is totally ok for now)
I just created a PR!
The test size is 0.3, and the random states is set to 1.
Perfect
@Zitong-Chen-16 just to make sure there isn't some trivial issue with the setup, can you shuffle the labels and report the values?
I am reasonably sure it is not a trivial issue, so just thinking ahead:
We'd want to know what features are driving the differences. Are they the same
@Zitong-Chen-16 just to make sure there isn't some trivial issue with the setup, can you shuffle the labels and report the values?
Here are the results after shuffling labels. The distribution looks fairly normal and centered at 0.5 (random guessing).
Exciting to be trying this approach! Shantanu listed the 8 "null controls" as Sam called them, but could someone elaborate on what those controls are intended to be/do? They look like a bunch of different genes (mostly WT/reference) so I'm not sure why we would expect them to all look the same, which is what I assume we had hoped/intended here?
I'm also curious what the prefixes are (516, 527).
could someone elaborate on what those controls are intended to be/do?
MAPK9, RHEB, SLIRP, & PRKACB are the negative controls (labelled "NC" under the "node_type" column) which should show no morphological phenotypes. Since there is no variant/reference pair here, perhaps we want to ignore the EGFP (protein) channel here?
ALK & ALK R1275Q are positive controls for the protein channel (labelled "PC" under the "node_type" column) where variant R1275Q should mislocalize to the ER.
PTK2B is the positive control for morphological changes suggested by Anne in a previous GH issue.
516 - TC is the transfection/selection control (labelled "TC" under the "node_type" column) and should have no remaining cells left following selection.
Plate 4 of this screen should also contain genes labelled "CC" under the "node_type" column for candidate controls. It would be great if those could be analyzed as well. Since PTK2B was never tested in our system before, we included alternative controls to check out.
I'm also curious what the prefixes are (516, 527).
516 and 527 are the Taipale lab's numbering system for which vector these expression ORFs are in. They got carried over while merging our tables and will likely be excluded in the future.
Thank you for the explanation Chloe!
They look like a bunch of different genes (mostly WT/reference) so I'm not sure why we would expect them to all look the same, which is what I assume we had hoped/intended here?
To answer Anne's question: the null control in this experiment is training classifiers to distinguish the well location of the same control treatments that are located on the same plate. In other words, we train one classifier for each _Control1Well1 and _Control1Well2 pairs, so we are not comparing across different control treatments. Ideally, if we don't see a strong plate layout effect, the F1 score of this classification task should be quite low and we can use that as the null baseline. But in reality, the scores are pretty high and that's why we are concerned because it indicates strong plate layout effect.
To understand what features were most important during classification and whether they were consistent across control and experimental pairs, I saved the weight of each feature after training each model and counted the number of times each feature appeared as the Top 1 most important feature for a model. Below are the results.
Across all pairs of control wells?
There are 48 pairs of control wells, meaning we trained 48 models.
I counted the number of times (the "Count" column) each feature appeared as the most important feature for a model.
32 different features were used as the most important feature in different models ctrl_feat_count.csv
The top 10 features with the highest frequency are presented below.
Feature | Count |
---|---|
Cytoplasm_Intensity_MeanIntensity_AGP | 7 |
Cytoplasm_Correlation_Costes_AGP_DNA | 4 |
Cytoplasm_Intensity_LowerQuartileIntensity_AGP | 3 |
Cytoplasm_Intensity_MeanIntensityEdge_AGP | 2 |
Cells_Intensity_MeanIntensity_AGP | 2 |
Cells_Intensity_UpperQuartileIntensity_AGP | 2 |
Cytoplasm_Intensity_MedianIntensity_AGP | 2 |
Cytoplasm_Texture_DifferenceEntropy_AGP_10_03_256 | 1 |
Cytoplasm_Correlation_Costes_GFP_Mito | 1 |
Cytoplasm_Correlation_Costes_AGP_Mito | 1 |
Across the reference-variant pairs?
- There are 240 pairs of reference-variant, meaning we trained 240 models.
- 103 different features were used as the most important feature in different models var_feat_count.csv
Top 10 features:
Feature | Count |
---|---|
Cytoplasm_Intensity_MeanIntensity_AGP | 14 |
Cytoplasm_Intensity_MADIntensity_AGP | 12 |
Cytoplasm_Correlation_Costes_AGP_DNA | 12 |
Nuclei_Intensity_LowerQuartileIntensity_Mito | 11 |
Cytoplasm_Correlation_Costes_AGP_GFP | 10 |
Cytoplasm_Intensity_MedianIntensity_AGP | 8 |
Cytoplasm_Intensity_MinIntensityEdge_Mito | 8 |
Cells_Intensity_MADIntensity_AGP | 7 |
Cytoplasm_Intensity_UpperQuartileIntensity_AGP | 6 |
Cytoplasm_Correlation_Costes_GFP_Mito | 5 |
Thoughts: Many of the most important features are some measurements of AGP intensity (which I believe is the actin channel?). I wonder if this has something to do with the artifacts we discussed in the previous emails about the first batch (B1A1R1), where Erin was saying we saw "a LOT of little spots in (at least) the Actin channel. It's possible these are dye aggregates but could also be bacteria. "
I actually have a question for @AnneCarpenter and @shntnu: Should we separate the protein and non-protein channels when doing classification? We hadn't discussed this before the experiment but maybe that was obvious to everyone else except me... I understand that protein and non-protein features allow us to draw different conclusions about changes in morphology (i.e. protein mislocalization/stability vs. morphological change) but I wasn't sure if that was the goal or if we just wanted to state that the variants are different from the reference.
A related question is that with the current experimental settings, we are not really exploiting the information we knew a priori about the negative and positive controls. We are merely using the fact that there are replicates on the same plate to control for plate layout effect. I am listing the alternative experimental setting here just to help myself clarify and to make sure I understood the purpose of the different controls correctly:
We can also use PTK2B (classifying against the negative controls?) as the positive control for non-protein channel features.
I can't think of why this approach would be better than just comparing the features between control replicates other than maybe augmenting the size of null baselines but just putting it here.
Re Q1: yes please do separate the protein and non-protein channels when doing classification - of course, one might have better sensitivity if combining all data available (so it's not a bad idea), but we are quite interested in knowing how each variant impacts protein function and clearly defining whether it's the protein localization vs the rest of cell morphology is a big piece of that. I think if a given variant is not significant for either protein localization nor cell morph but IS significant when combining all channels we would be worried we are scraping the bottom of the barrel and in the range of statistical noise so it's ok to "lose" those variants we would otherwise be able to get.
Re Q2: I agree with your 3 bullets (and yes, PTK2B would be against neg controls); these are all reasonable controls to run. I'm not sure I understand your overall question - I think you're saying we are NOT planning to do any of those 3 bullets and we should reconsider - if so , I agree! Or perhaps better stated, I may need a reminder if what we are doing instead - you said "just comparing the features between control replicates" but I'm not sure what it means.
Q3: I believe these neg controls are thought to NOT impact the non-protein channels. I suspect they are tagged proteins though, so there will be fluorescence associated with the protein channel. Without variant pairs, they can't be pos or neg controls for having a change in localization due to the variant. Hopefully Chloe can confirm the tag status.
3. I also have a question about the negative controls for @renochlo and sorry if this is basic biology: for the negative controls, is it true that we expect them to not no significant changes in cell morphology (indicated through non-protein channel features)
If low replicate correlation from cpjump1 indicates no significant morphology changes, then yes. PRKACB, RB1 , MAPK9, and RHEB were recommended by Marzieh as negative controls in this GH issue. However, we had issues growing RB1. We observed large inconsistent deletions within the gene during cloning and therefore replaced it with SLIRP (I am not sure why we chose SLIRP now, seems like we should have chosen an ORF with a lower correlation - please let us know if this poses a problem and we will replace).
and have no signals in the protein channel (this is the part I am confused about)?
There should be a signal in the protein channel as the proteins are still tagged with GFP, however this channel is irrelevant for the negative controls and should be ignored. It just serves to replicate the conditions/vector that all other treatments received. The comparison for the protein channel would be reference vs. variant. (We considered dropping the GFP-tag for the negative controls, however, that caused errors during image acquisition. Plus that would mean cells are receiving a different construct)
Thank you for the thorough explanations @AnneCarpenter and @renochlo!
Re Q1: yes please do separate the protein and non-protein channels when doing classification {...}
That makes a lot of sense! I am working on that now.
Re Q2: {...} I'm not sure I understand your overall question - I think you're saying we are NOT planning to do any of those 3 bullets and we should reconsider - if so, I agree!
Sorry about the confusion and yes, I thought we were not testing those controls in this manner so thank you for clarifying.
Or perhaps better stated, I may need a reminder if what we are doing instead - you said "just comparing the features between control replicates" but I'm not sure what it means.
IIUC, in our last meeting, we decided to set the null task as classifying the well location of the same control treatment. In other words, we were only comparing the replicates of the same control treatments (on the same plate) but not comparing across different control treatments. This is mainly to account for the plate layout effect, with the null hypothesis that the classifier is able to distinguish between reference vs. variants because of well location rather than phenotype. So if we show that the classifiers cannot distinguish the same treatments on different wells, we can then reject the null hypothesis. I think this is still a good null baseline to have regardless of what the specific controls are since we should expect the same treatments to have the same phenotypes on both protein and non-protein channels.
The bullet points outlined above would then mean we run the following additional experiments:
Previously the features used for classification were neither normalized nor feature-selected. Here I used mad_robustize to normalize the features and then performed plate-level feature selection (# of features=684). Then I ran the same classification pipeline as before.
The general observation is that performing feature selection drags down the performance of both the null controls and the reference vs. variant experiments but we still don't see a separation between the distribution of the two groups (see Macro-averaged F1 score).
Additionally, the negative controls seem to have a very strong plate layout effect judging from the high F1 score in both protein and non-protein classifications (see Boxplot). I investigated which features are the most important for differentiating the negative controls, summarized in the tables:
What stands out to me is again how AGP seems to dominate the classifications. I wonder if this has something to do with the (suspected) actin aggregates we observed in Batch 1. Here I include an image of the actin channel from one of the TC wells. I am not very familiar with how these images should look so it would be great if someone could help me double check:
I also noticed something a bit weird that maybe @ErinWeisbart could help me with. In the transfection control (TC) wells, there should not be any cells remaining, but CellProfiler did still identify some of the cell debris as cells and those are included in our experiments as well. For instance, Well E5 and E6 from plate 1 have 35 and 32 profiles respectively. I am wondering what is the best practice for dealing with those - should I just exclude them or is there any normalization I can do at this point to use those as the baseline of dead cells to filter out other potentially dead cells in non-TC wells? I don't know if the fact that there are profiles in TC well indicates that there could be other untransfected cells in other wells that are messing up with the population distribution.
Here I include an image of the actin channel from one of the TC wells. I am not very familiar with how these images should look so it would be great if someone could help me double check:
I only skimmed the most recent posts, but I would strongly suspect the aggregates in the Actin channel are throwing things off. There should be no punctae in that channel and there are very bright punctae that are stronger than the actual channel signal in the image that you shared. One tip for knowing that the staining isn't as it should be - you can see the same pattern across the background and cells alike. The background should never have patterns in it (depending on the signal:noise you might see a light haze but it should never have discernible structure).
I would suggest removing all Actin features and seeing what performance is.
CellProfiler did still identify some of the cell debris as cells and those are included in our experiments as well. For instance, Well E5 and E6 from plate 1 have 35 and 32 profiles respectively. I am wondering what is the best practice for dealing with those - should I just exclude them or is there any normalization I can do at this point to use those as the baseline of dead cells to filter out other potentially dead cells in non-TC wells?
There are a number of different (non-exclusive) ways you can handle the cell debris. 1) ignore it (frankly, this is what we usually do, though without some actual looking I can't say whether the level of dead cell identification is higher in this dataset than other datasets) 2) rerun analysis with segmentation parameters that do a better job of filtering out dead cells (could be setting stricter minimum cell object size limits) and/or adjust the parameters on batches going forward 3) build some sort of classifier to filter out the cell debris from the single cell data in all wells. You could probably do this pretty quickly and easily in Cell Profiler Analyst. We have tutorials. You could start by doing this on a single plate and see how much it improves the signal to know if it's worth doing it across the batch.
There should be no punctae in that channel and there are very bright punctae that are stronger than the actual channel signal in https://github.com/broadinstitute/2021_09_01_VarChAMP/issues/12#issuecomment-1699907797.
Thank you for checking @ErinWeisbart and thanks for the suggestions on cell debris as well. I will @renochlo here so she is aware of the issue. Chloe, the image I shared is from the actin channel of a transfection control well in B1A1R1, and we are concerned that the actin aggregates are messing up with the downstream analysis. I am not sure if this is still something we need to worry about if we do indeed replace the actin channel with the new transfection control FP as discussed in yesterday's meeting, but just wanna keep you guys posted.
I see, those are very bright. We did track down these artefacts in the AGP channel to be a result of contamination from our automated plate filler during staining. We've updated our protocol to include extra decontamination steps (specified here) as well as allocated separate equipment for VarChAMP staining use only to prevent future cross-contamination. We hope this will not be problem in the future.
We are currently re-running these alleles (B03A01-A02) with longer selection (as there shouldn't have been cells remaining in 516 - TC) and better staining.
Awesome, thank you for the update, Chloe!
To find out whether the strong plate layout effect I observed is specific to the Varchamp dataset or due to the supervised method, I am using this single-cell classification pipeline on one Target 2 plate BR00121425 from the JUMP ORF dataset.
Results 95% percentil of null experiment = 0.9658994032395568 1087 out of 2016 compound pairs pass the 95% threshold (has higher F1 score).
Shantanu also suggested comparing supervised v.s. unsupervised method on the DMSO control wells by comparing the F1 classification score to the cosine distance (1 - cosine similarity) between each pair of wells. The goal is to see if there is a consistent pattern between the two heatmaps. If so, the classifiers are likely picking up on actual differences between the profiles instead of benefiting from data leakage, etc. Below are the two heat maps. We do observe some patterns between the prediction score and the cosine distance: well pairs with high classification scores also have high distance which is intuitive.
Conclusion The null control for the JUMP Target 2 plate (DMSO at different well locations) shows a somewhat different distribution compared to the experimental group with median at around 0.87, though they are not centered around 0.5 as in the shuffled experiment. About half of the compound pairs pass the 95% threshold. Note that compared to the Varchamp experiment, we have a lot more pairs of null controls on one plate in this experiment (2016 in jump v.s. 48 in varchamp).
Looking forward to digging in on this! Note that column 1 and 24 are the edge columns, and A and P are edge rows. Some of the red stripes are 24's which may be because of plate layout effects (ie readily distinguished from other compounds because of its location on the plate not because of its identity) so that makes sense (even if we prefer it to not happen!) I don't have a good explanation for the C09 and D09 - maybe just noise/artifact affecting those two wells, or maybe they have a dramatic unique phenotype.
I do wonder if we ought to throw out A/P and 1/24 (I wonder if DMSOs are never in these locations so our null doesn't ever samples those kinds of edge effects but the real samples do)
All that is sort of a side note - I look forward to talking overall about these results!
Upon some investigation, I realized that our previous comparison between null and ref-var was unfair. Specifically, the null experiments are performed within each plate, whereas the reference vs. variants experiments are performed among four replicate plates at the same time. To correct for fairness, I rewrote the pipeline such that we are only performing in-plate classification for both null and ref-var. The result histogram is included below. Here, we finally observed a separation between the null and experimental distribution, but the median of null is still clearly above 0.5.
To leverage this issue, we decided to try a stratified approach for both null and ref-var, in which we train using data from 3 replicate plates and test on 1 held-out plate. So our F1 scores are based on a stratified k-fold cross-validation (where we stratify by plate). For example, to train a classifier for MAPK9-Plate1T1-A01 vs. MAPK9-Plate1T1-B02, we train model to predict A01 vs. B02 using data from MAPK9-Plate1T2, MAPK9-Plate1T3, and MAPK9-Plate1T4, before testing on MAPK9-Plate1T1. And the same applies to ref-var. With this approach, we were able to obtain a null distribution centered at 0.5.
We therefore conclude that using a stratified approach is likely most suitable for our problem. In an internal discussion, @shntnu gave an intuitive explanation for why we are getting 0.5 centered null with this approach.
The variants identified as hits with this approach likely have a high false positive rate due to several reasons.
Now that the Snakemake pipeline is ready and Jess has confirmed using Batch 6 data that the plate layout effect cannot explain the shift in distribution we previously observed, we now feel confident with processing the 1% variants using the single-cell classification approach we have developed. For the 1% variants, we have two biological replicates, Batch 7 and Batch 8, and four technical replicates within each batch (e.g. P1T1 - P1T4 are four technical replicates with the same platemap). Below are the results from Batch 7, and I am finishing up the analysis with Batch 8.
Since the last update, I have been working on reimplementing the pipeline using Snakemake and optimizing it by trying different preprocessing steps. The pipeline, starting from per plate single-cell .sqlite
files to final classification results, is now completed and can be run with one single command line instruction. Link: 9.downstream_analysis_snakmake (will rename after clean-up).
The current pipeline includes the following preprocessing steps:
.sqlite
to .parquet
using CytoTableThanks, @Zitong-Chen-16 !
I'm curious about the CON-CON classifiers that gave good results:
In the two batch 6 scenarios (ALK WT-VAR, MAPK9-LPAR1), there is no second peak, just a single gaussian-looking peak slightly to the right of 0.5. Of course, 2 scenarios isn't much, but I'm wondering if the high performing CON-CON classifiers have any kind of obvious pattern (ie. all the same control, very different cell viability across plates, etc).
@jessica-ewald great observation! I was curious about this too. I highly suspect that this is caused by cell count. From a previous analysis, many of the good results in Null is caused by an extremely unbalanced class size, in other words, one well has hundreds of cells while the other have only a few. To account for this, I have added a step in the pipeline to filter out wells with fewer than 8 cells (the max number from Transfection control wells), and also added in a parameter in the classifier to account for class imbalance, but it's quite possible that some low cell count wells remain in the Null experiment. I am planning to look into this more and will report back.
Exciting to see these results rolling out! some comments...
It makes a lot of sense (to me) that protein channel effects are more commonly seen than morphology effects. I would just guess they’d be more common (I think this variant set is somewhat random and not enriched for known localization defects, but even so I would imagine localization will be more common than more general impact on morphology).
I think it will be great to look at images from different regions of the single- vs well-level scatterplot to see what is the story and if it all makes sense (vs weird technical effects).
For context, see https://github.com/broadinstitute/2021_09_01_VarChAMP/issues/5#issuecomment-1642753159 and the comments that follow