What controls should be replaced from the 1% data?

jessica-ewald commented 2 months ago

Background There are 4 types of controls in the Varchamp data: transfection control, positive control (morphology), positive control (localization), and negative control (morphology). We are revisiting the positive and negative controls because: 1) some variants are not what we thought they were, and 2) controls were chosen based on analyzing JUMP data and we want to confirm that they are actually pos/neg cons based on Varchamp data.

Positive localization control Chloe asked if we have any suggestions from the 1% data. I would suggest taking one of the pathogenic variants that was sequenced and has both a high AUROC and a high confidence sequencing score (for both the WT and the Variant).

Positive and negative morphology controls Chloe asked if I can confirm that SLIRP is an adequate negcon, and to select one of the candidate positive controls to replace PTK2B which came back as something else when sequencing. This is where we have issues. In the current analysis/data state, there is no distinguishable difference between any REF-VAR pair. This is because when we construct the control-control NULL, we get the complete range of AUROC values: from 0 all the way to .99 (maybe even 1.0). Thus, no morphology profile comes back as a hit. We expect that this could be because of the confocal z-plane thing, where random replicates with quite different z-planes throw off the whole analysis. This holds even if we throw away AGP, which suffers from this the most.

What to do

Select a new positive localization control. I need to organize results for 1% paper anyways, so will suggest a few pairs when I'm doing that.
For morphology poscons, maybe look at all of the wt-var morphology classifiers by eye that had high AUROC values (>0.8) and choose several with visually distinguishable morphology?
For morphology negcons, I'm afraid we can't do this right now. I'm not confident in our ability to pick out negcons by eye, since we know that many times we find significant associations in CellPainting datasets that reproduce known biology, yet are not visually distinguishable. I think we should revisit this with the next batch of data, after switching from confocal to widefield.
Continue to characterize the wt-var morphology issues. Perhaps it can be partially solved by trying to filter out wells with extreme z-plane values. I don't know what features we would use for unbiased, automated filtering, but I will think more about this.

AnneCarpenter commented 2 months ago

Great!
Great!
Are there enough replicates of any individual samples in category 1 or 2 to use them as morphology neg cons? because morphology should not change from one replicate to another of the same sample. Of course, no morph neg con will work with the null being so variable but once that settles down, this could be a solution.
There are a number of approaches to filter out blurry images (if I'm right that this is how they look?)
- use a pre-trained deep learning model. We never adopted it for mainstream use and I don't know why; presumably because we rarely have focus issues: Yang SJ, Berndl M, Michael Ando D, Barch M, Narayanaswamy A, Christiansen E, Hoyer S, Roat C, Hung J, Rueden CT, Shankar A, Finkbeiner S, Nelson P (2018). Assessing microscope image focus quality with deep learning. BMC Bioinformatics 19(1):28962 / doi. GitHub. PMID: 29540156. PMCID: PMC5853029 https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-018-2087-4.pdf
- Use individual metrics that indicate blurriness (usually these will be image-level metrics rather than cell-level metrics. These papers should tell you which are good. The first one also (maybe exclusively?) talks about training a classifier to recognize blurry images on a per-dataset basis but that seems overkill: (a) Bray MA, Carpenter AE (2018). Quality Control for High-Throughput Imaging Experiments Using Machine Learning in CellProfiler. Methods Mol Biol 1683:89-112 / doi. pdf. GitHub. BBBC. PMID: 29082489. PMCID: PMC6112602 https://carpenter-singh-lab.broadinstitute.org/files/anne/files/111_Bray_MiMB_2018.pdf (b) Bray, M-A, Fraser AN, Hasaka TP, Carpenter AE (2012). Workflow and metrics for image quality control in large-scale high-content screens. Journal of Biomolecular Screening 17(2):135–143 / PMID: 21956170. PMCID: PMC3593271 https://carpenter-singh-lab.broadinstitute.org/files/anne/files/56-Bray_JBiomolScreen_2011.pdf

jessica-ewald commented 2 months ago

For 3. - there are 4 replicates of every control, so we've already been doing pretty robust ctrl-ctrl comparisons to generate our null distribution. I think this is perfectly fine to continue with! In this case, we could use any allele as a negcon (so long as it is repeated), so we may not need specifically labelled 'negcons'. Having a decent number of repeated controls is important .. maybe its a good idea to include another mislocalization poscon wt-var pair instead of the current negcons?

For 4., there are images of AGP here: https://github.com/broadinstitute/2021_09_01_VarChAMP/issues/23#issue-2388901414. It's not really blurry so much as when the z-plane is close to the bottom of the cell, there is much higher intensity because its where the actin is attaching to the plate (Beth's explanation). A simple filter could be analyzing the median AGP intensity and flagging outliers (w.r.t. the replicates).

AnneCarpenter commented 2 months ago

Great, for 3, makes sense! It would be nice to keep negcons in the hopes that some day we can detect them. For 4, I see. It really seems there's no choice but to reduce the technical noise introduced by the auto-focusing (in future data collections). Filtering out 'bad' images can only get us so far!

jessica-ewald commented 2 months ago

Here are 3 options for new mislocalization positive controls. I selected WT-VAR pairs that are clearly distinguishable, have similar intensity/protein abundance, and where the cells have a high count and look healthy for both WT and VAR.

HPRT1 His204Asp WT: VAR:
GMPPB Asp27His WT: VAR:
RAB33B Lys46Gln WT: VAR:

jessica-ewald commented 2 months ago

Possible morphology positive controls To recap, choosing these quantitatively is difficult because the 95th percentile of the ctrl-ctrl morphology AUROC null is 0.99, therefore like 0 or 1 wt-var pairs make it past the threshold. We thought we still might be able to choose some by visually examining images for our wt-var pairs that gave the highest AUROC values.

I plotted images of the top 12 wt-var pairs (mean AUROC > 0.95). After looking at the images, I excluded pairs where there were overall low cell health for both wt and var, or cases where there was an extremely different cell count between wt and var, or cases where I could not visually distinguish the morphology of wt and var. This left me with two wt-var pairs:

GSS Arg125Cys

KHL3 Cys164Phe

AnneCarpenter commented 2 months ago

Ooof, those honestly seem pretty subtle to me! I can convince myself there is a difference but only if looking back and forth between them, it's not like a glance at an image would immediately make it obvious which category it's in.

If someone else feels they are consistently distinguishable and can verbalize how, then great - but if not I wonder if we should go with a backup plan of looking for samples where cell count is fairly normal but something like nucleus eccentricity (a very easy to see shape measure) is distinctive? Of course we don't have an infinite number of pairs to test so maybe there is no pair that differs strongly in that feature, we could try nucleus or cell area too.

jessica-ewald commented 2 months ago

The main difference that I see is that WT cells look rounder, and VAR cells look more spindly. I agree that it's very subtle - was hoping for something much more distinguishable!

I can try targeted features like you suggested.

AnneCarpenter commented 2 months ago

Eccentricity might be even more visible for cells than for nucleus actually so that may be a good route!

AnneCarpenter commented 2 months ago

re poscons, I forgot to post when offline traveling:

Oh wow, well 2. GMPPB Asp27His is just spectacular to look at and seems extremely consistent in both versions (REF and VAR). It will rely on good segmentation to be detected since it’s pretty cytoplasmic in both cases, just has the extra plasma membrane that makes it distinct.

The others are more variable: perhaps useful if you want a milder poscon that is closer to the boundary of what we hope to find? I could go either way on that. If you pick a milder one, maybe 1 HPRT1 His204Asp Is a bit better because it’s a bit more consistent well to well.

renochlo commented 2 months ago

Thank you Jess!

We will sequence verify all your suggested clones! Another suggestion for the localization positive control are AGXT variants. AGXT AGXT Ala85Asp AGXT Asn22Ser AGXT Asp201Asn

I like the idea of having a milder poscon closer to our cutoff boundary, could we perhaps drop a negcon to include both a mild and strong localization poscon? How useful do you think this is for analysis?

3 & 4. When you construct the control-control NULL using the negcons, is that the exact same negcon compared to itself (i.e. RHEB vs RHEB) or negcons compared to one another (i.e. RHEB vs. SLIRP)? Although the selection of these negcons were based off CPJUMP, I don't think we can assume these are our best negcons/morphology controls. Does the scrambled controls plate help at all? Do you see consistent morphological differences irrespective of well position across constructs there? If so, we may not have selected the best constructs. What happens if you compare the morphological profiles of all WTs against each other?

broadinstitute / 2021_09_01_VarChAMP

What controls should be replaced from the 1% data? #27