CRISPRi validation of rare variant associations

mtegtmey commented 2 years ago

New data has been generated and transferred to @bethac07.

Here is the per well metadata for the plate. cmQTL_CRISPRi_metadata.xlsx

shntnu commented 2 years ago

@bethac07 asked

Do you want us to use the old pipeline and CP3, or proceed as if this was a new batch?
Likewise, do you want us to use the old R based workflow downstream, or is it ok to use the recipe and pycytominer? I did check that at the time of your notes, the default aggregation in cytominer_scripts was mean (as it currently is in collate.py) , which I would think/hope would be the major difference.

@AnneCarpenter replied

I think the goal here is to spot check a few genes vs a few features, so I think it's fine to use any version of CP that's convenient and I don't think it's important for the profiling to be identical either.

@mtegtmey -- the goal (as stated by Anne) sounds right to me but please confirm

mtegtmey commented 2 years ago

@bethac07 @shntnu I agree with Anne. Any version of CP should be fine since we have strong priors going into these validations.

bethac07 commented 2 years ago

Something I wanted to flag- some of these nuclei look pretty weird/bad, at least compared to what I'm used to looking at (cancer cells) - note the range of brightnesses, that one that's got weird holes in it, etc. (Ignore the bad segmentation for now, that's fixable). It's been literally years though since I did the original assay dev on these, so it's possible I'm misremembering, AND I don't know exactly how these were treated - is there any reason we should expect this? I assume I should try to keep everything?

AnneCarpenter commented 2 years ago

My first thought is that it's physiological and related to differentiation state but I don't really know why that popped up for me, I've no evidence/knowledge! Curious if @mtegtmey remembers anything.

bethac07 commented 2 years ago

Definitely possible Anne! My first thought is it's an incomplete drug selection, so knowing if selection was used here would be super helpful.

mtegtmey commented 2 years ago

Sorry for a late reply, I’ve been travelling much of the day!

It’s odd about the nuclei. I have a few thoughts about what we’re seeing.

To catch Beth up, what we’ve done here is used a constitutive KRAB-dCas9 CRISPRi system to knockdown expression of 5 genes which we identified to have cell morphology features which associated with rare LoF variants in these genes. Cells have undergone chemical selection for ~ 1 week and knockdown of each gene was validated by qPCR before Cell Painting.

It could be physiological. I suppose, depending on which wells were being used for assay dev, knockdown of these genes could elicit a strong nuclear phenotype (however, the cells under bright field and while growing behave normally). I would be curious if this same phenotype is observed across the other knockdowns or in the control (non targeting sgRNA) wells.
There is some evidence that dCas9 maybe toxic to stem cells. Possibly, we are observing this here when have greater resolution of the data but not something seen by eye under bright field. Perhaps an abundance of this protein floating around the nucleus has detrimental effects in some population of the cells.

My initial thought is to keep all of them. Since it could be driven by the physiology, I’m hesitant to toss any out.

Perhaps if it’s not too much extra work, looking at wells from 2-3 different conditions could help tell us whether these kinds of things are seen broadly across the plate.

On Sep 9, 2022, at 4:36 PM, Beth Cimini @.***> wrote: Definitely possible Anne! My first thought is it's an incomplete drug selection, so knowing if selection was used here would be super helpful.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

bethac07 commented 2 years ago

I've had a chance to dig into it a bit further now, and thankfully, it seems to just have been a technical artifact from a bad analysis setting. Briefly, we nearly always after whole-plate illumination correction do an "enhancement" on the nuclei channel - it helps sharpen nuclear edges a bit, removes large debris, etc. In this particular case though, it seemed to be removing real signal and leading to this effect (for at least many of the cases). I switched to a "gentler" method of doing background removal and it seems to be performing much better on this particular data! I'm hopeful I can get analysis started today and backends by tomorrow. (cc @shntnu)

bethac07 commented 2 years ago

There are some QC issues with the plate that I'm noticing - there are a few wells just with standard "schmutz" (the bright blue/yellow bits), but additionally something was in the light path for a good chunk of the plate, and unfortunately, not statically, but moving around. It shows below as a red "band", but it's not actually bright, but it's that it seems to block the signal more in the blue/green part of the spectrum, so it's a "hole" we have to fill instead, which is much harder (if it were truly bright we can just block it out, which is my plan for the schmutz). Doing per field background calculation can help a bit with it but not fully. I don't think there's anything to do here but proceed, but FYI.

bethac07 commented 2 years ago

I chatted with the other image analysts, and none of us could come up with a good way to solve this floating debris issue, nor had we come across it before - we've had images where the debris was fixed and static, but never this particular issue. It does seem to be relatively consistent across the sites within a given well, just not across wells on the plate.

Basically, there are going to be a couple of options - 1) Live with it - if you're planning to look at per-well aggregated features, we could consider median-aggregating rather than mean aggregating. If you're looking at single-cell data, consider clipping the ends of the cell distribution (in feature space) for each well. 2) Image again. Since there also seems to be some in-well debris, maybe doing an extra change/wash of PBS (or whatever you're storing in) will help a bit, but more importantly we should hopefully not see this floating debris issue again.

@mtegtmey Is 2 possible and/or plausible? Or are our strong priors strong enough that we feel ok going ok with the images we currently have? Let me know how you want me to proceed.

mtegtmey commented 2 years ago

@bethac07

If it’s consistent across wells, we may be ok proceed with using median-aggregated features.

My analysis will be fairly straightforward. I have between 5-50 features which I’m interested in comparing between the knockdown and control samples using something like a Mann-Whitney t-test.

If all wells are treated the same we should be ok.

I did just confirm with one my undergrads that I have a backup plate which could be imaged in the next day or so if I can coordinate with someone at the imaging platform which could give us new data before Monday.

Let me know what you think!

On Sep 15, 2022, at 4:27 PM, Matthew Tegtmeyer @.***> wrote:

It’s really something I would have to defer to your team on.

We are looking at aggregated per well level features, so if clipping the ends and using median-aggregated scores is possible, then I would say go ahead.

I could generate new data. I’m away for a few weeks and could create a new plate the first week of October.

If the data is suspect, I would much rather delay by two weeks and have better images than to try digging on the computational side. I’m just unsure why this has happened, I looked at many different wells with my imaging parameters and saw nothing strange.

On Sep 15, 2022, at 3:48 PM, Beth Cimini @.***> wrote:

I chatted with the other image analysts, and none of us could come up with a good way to solve this floating debris issue, nor had we come across it before - we've had images where the debris was fixed and static, but never this particular issue. It does seem to be relatively consistent across wells, just not across the plate.

Basically, there are going to be a couple of options -

Live with it - if you're planning to look at per-well aggregated features, we could consider median-aggregating rather than mean aggregating. If you're looking at single-cell data, consider clipping the ends of the cell distribution (in feature space) for each well. Image again. Since there also seems to be some in-well debris, maybe doing an extra change/wash of PBS (or whatever you're storing in) will help a bit, but more importantly we should hopefully not see this floating debris issue again. @mtegtmey Is 2 possible and/or plausible? Or are our strong priors strong enough that we feel ok going ok with the images we currently have? Let me know how you want me to proceed.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

bethac07 commented 2 years ago

Unfortunately, the floaty thing is NOT consistent across wells- it's present in about the top quarter of the plate, and nowhere else. Since your replicates are pretty tightly grouped physically across the plate, that means it will affect some samples in every well and some not at all. If we're median aggregating, it's PROBABLY fine, because it definitely doesn't cover more than 50% of the cells, BUT if the features you're hoping to look at don't involve "100% of cells get 10% higher/lower (x)" but rather "10% of cells get 100% higher/lower (x)", by switching to median aggregation we lose the ability to detect that - does that make sense?

If you have a backup plate, or even if it's possible to image that same plate again (the behavior of the floaty thing says to me it was some piece of dust that was on the bottom of the plate/fell onto the objective from the air, so as long as whomever is imaging it gives the plate bottom a quick swipe with some lens paper, we shouldn't have the same issue again - like I said, I've never seen it before in 6 + years analyzing plates from that microscope!), I think running the imaging one more time is our best chance of getting maximum-quality data. I don't think the data quality we're going to have even with the data we have now is going to be BAD, I'm just saying what's ideal. In practice, though, we rarely ever have ideal data, so if the decision is "live with it, and maybe consider median aggregating, but just keep an eye out for our phenotypes in that corner" (we can always run it both ways - mean AND median aggregated), I don't think it's ruined by any means- we should just keep in mind that it might be somewhat aberrated as we then look downstream).

mtegtmey commented 2 years ago

Ok - I've booked the Phenix for Monday morning. I'm working on coordinating a handoff to IP folks who can set up and run the imaging for the backup plate and transfer the images to you.

It could also be possible to simply remove any/all impacted wells from the downstream processing. Each condition has 28 replicate wells on the plate (56 for controls), so even chopping 1/4 of the total wells should still leave us with plenty per condition to accomplish our goals In this experiment.

bethac07 commented 2 years ago

It could also be possible to simply remove any/all impacted wells from the downstream processing. Each condition has 28 replicate wells on the plate (56 for controls), so even chopping 1/4 of the total wells should still leave us with plenty per condition to accomplish our goals In this experiment.

Wow, I hadn't realized there were so many - I still think it's worth re-imaging the backup plate (and thank you for arranging that to be true!), because I still think given how much work it take to get to this point if we can get cleaner data, we should get cleaner data, but if for some reason you decided not to re-image, that's good to know. (We might also think about using it as an internal test case to see how much an artifact of this case REALLY messes with our ability to detect phenotypes.

mtegtmey commented 2 years ago

@bethac07 new images should get transferred to /imaging/analysis/2018_06_05_cmQTL later this afternoon! I imagine everything would be all set for analysis by tomorrow morning. I'll update if there are any issues.

mtegtmey commented 2 years ago

@bethac07 were you able to see if the new image set has the same floating debris issue?

bethac07 commented 2 years ago

It does have some of them still, so I think they must have been inside the wells - I think possibly one or more solutions must not have been filtered fully. The second plate was definitely less bad. I analyzed both plates, just to get a sense of how much if at all this is going to affect the profiles - I should have profiles to Shantanu later today.

mtegtmey commented 2 years ago

ok thanks for the update! I wonder whats going on. Did it still tend to be in the upper corner?

bethac07 commented 2 years ago

Yup, it was the same part of the plate. I can post the plate images later today.

bethac07 commented 2 years ago

So the two plates look - pretty similar! Note that I had to remove the Costes features in both batches - I don't know why those are misbehave-y again, but FYI. If you want to play with these in Morpheus yourself, I've uploaded GCT files with Costes features removed. I know overall profile similarity isn't the goal here, but just wanted to graph it.

Original run, sorted by "sample ID"

Rerun, same

Top right plate corner, original run (I can send the full files and they are also on AWS, but since they're ~300MB each I can't attach them here).

Rerun

AnneCarpenter commented 2 years ago

Great! What's the next step/handoff to whom?

mtegtmey commented 2 years ago

I think the clustering by gene target is promising! If the profiles are ready to go, I'm happy to take them over and run the analysis. I'm not sure if they needed some more fine tuning on @shntnu's end.

bethac07 commented 2 years ago

Given that the clustering results seem to show reasonable signal, I don't think there's any harm in going forward with this data, it's not PERFECT data but it is likely well within the range of "can get reproducible results from". I believe your plan was to query particular features, so should be good to go for those; if you want to do more clustering work, the only thing I'd personally recommend is removing the Costes colocalization features, since they seemed to be poorly behaved in both batches here (this is not the first time we've seen this with Costes features). I removed them in the GCT files that I uploaded for playing with in Morpehus, I just didn't want to mess with the underlying profiles themselves. If @shntnu signs off, I think we're good!

AnneCarpenter commented 2 years ago

Who will be the one to check the genes vs the features? @MarziehHaghighi did such an analysis in our ORF data and might be called in (she's to be an author on this cmQTL paper) to do the exact same analysis here. Beth would it be clear to her where the files are? It's a bummer we are meeting at noon today because it would really help to know if this worked out or not to decide next steps for submitting the paper!

AnneCarpenter commented 2 years ago

(in a pinch we could look at the 3 features in Morpheus and just rank-order samples by each of those 3 and see if the hoped-for gene names are at the top/bottom of the list): Cytoplasm_AreaShape_Zernike_9_3 Cells_RadialDistribution_RadialCV_Mito_1of4 Cytoplasm_Granularity_3_RNA

mtegtmey commented 2 years ago

I will have results posted in a few hours! Working on it now.

On Oct 11, 2022, at 8:30 AM, Anne Carpenter @.***> wrote:

(in a pinch we could look at the 3 features in Morpheus and just rank-order samples by each of those 3 and see if the hoped-for gene names are at the top/bottom of the list): Cytoplasm_AreaShape_Zernike_9_3 Cells_RadialDistribution_RadialCV_Mito_1of4 Cytoplasm_Granularity_3_RNA

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cmQTL/issues/71#issuecomment-1274611034, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMSE5ETQG67BDVT5Z4VLJU3WCVMWDANCNFSM6AAAAAAQHWO6WE. You are receiving this because you were mentioned.

mtegtmey commented 2 years ago

OK, here are the results running a simple Welchs Two Sample T-Test for those features which we had rare variant burden. There are two genes, PRLR and KCNK6 where the specific features weren't present in the most recent run. (I have only checked plate two, so I will see of those feature come in Plate 1). These are very promising results! ZNF436 has only a very subtle change in this feature, but from the gene expression data, the knockdown efficiency was only about 10%.

bethac07 commented 2 years ago

There are two genes, PRLR and KCNK6 where the specific features weren't present in the most recent run. (I have only checked plate two, so I will see of those feature come in Plate 1).

What were the features? Were you looking in the normalized.csv or the feature_selected.csv? Everything measured should be present in the normalized; a couple feature names changed slightly between 3 and 4, so if they aren't present there, LMK and I can try to help you find the matching ones.

mtegtmey commented 2 years ago

I was looking at feature_selected.csv. I’ll peek in the other file!

On Oct 11, 2022, at 9:04 AM, Beth Cimini @.***> wrote:

There are two genes, PRLR and KCNK6 where the specific features weren't present in the most recent run. (I have only checked plate two, so I will see of those feature come in Plate 1).

What were the features? Were you looking in the normalized.csv or the feature_selected.csv? Everything measured should be present in the normalized; a couple feature names changed slightly between 3 and 4, so if they aren't present there, LMK and I can try to help you find the matching ones.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cmQTL/issues/71#issuecomment-1274656871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMSE5EX7HULXUZBCQAJ3Z4LWCVQWBANCNFSM6AAAAAAQHWO6WE. You are receiving this because you were mentioned.

shntnu commented 2 years ago

normalized.csv

@bethac07 Thanks again for taking this off my plate 🙏

Could you clarify what this was normalized to? The controls, presumably?
How did you run the profiling – was it the profiling recipe? If so, which version of the recipe? (this is just for our notes)

mtegtmey commented 2 years ago

OK, I dug the other two feature associations from the normalized.csv data. We're 5/5 it seems on validating some of our top hits!

AnneCarpenter commented 2 years ago

I know I shouldn't be SHOCKED, but that's truly fantastic news!

bethac07 commented 2 years ago

@shntnu I did whole plate normalization, due to the small number of just overall samples (and because that's how previous batches were run, because there weren't negative controls), but I could rerun with normalize_negcon if we we wanted. The version of the recipe/template and the config file are already committed to the repo. @mtegtmey Yay!

shntnu commented 2 years ago

@mtegtmey

Wow! If it's easy, would it be possible to plot all 5 features x all 5 genes for the CRISPRi data? (and maybe later for the iPSC data)

It will be reassuring to known that we're seeing gene-specific effects here (although the clustering is already reassuring in that sense)

shntnu commented 2 years ago

I did whole plate normalization, due to the small number of just overall samples (and because that's how previous batches were run, because there weren't negative controls), but I could rerun with normalize_negcon if we we wanted.

I think that makes sense, Beth, because (IIUC) the genes are not expected to be related in any way (if they were, whole plate would not be a good idea)

The version of the recipe/template and the config file are already committed to the repo.

Thank you!

mtegtmey commented 2 years ago

This may make things less exciting, but it does appear that many of these features change across the various genes. We are reassured that they seem to cluster by gene target in morpheus, but we should think about this results. I suppose it's possible that knocking down these specific genes could impact each of these individual features.

From the wet-lab perspective, each of the cells were treated identically minus the different gene targets. The control samples are also infected with non-targeting sgRNAs, so they are exposed to the same chemical selection, as well as having free-floating dCas9 in the nucleus (which I'm sure causes some phenotype).

Screen Shot 2022-10-11 at 10 46 14 AM

mtegtmey commented 2 years ago

Do we have any strong feelings or thoughts on this?

Though we do see each perturbation impacting these features, the direction of the association with the change in the feature is the same, which is promising.

WASF2, PRLR, and TSPAN15 are all known to regulate proliferation/cell adhesion to varying degrees. So, I think completely knocking out these genes could very likely impact this specific set of features. But we could think about an alternative way to normalize the data if we feel a little uneasy about this.

AnneCarpenter commented 2 years ago

I don't have a clear picture of what to think. It comes down to how confident we are that the negative control is a reliable/good neg control (and not itself a weird outlier for some reason). I don't think you have any reason to think it isn't a good control. In a perfect world we'd have dozens of other genes in this plot (or other kinds of neg control) to reassure ourselves that the genes of interest is relatively unusual in its feature of interest; but we don't have this kind of data.

I agree, it's possible that these 5 genes are not expected to give random/different phenotypes if they share some biological functions. We already know it's the case that when adhesion/proliferation are impacted then tons of features all change.

It's reassuring that at least in the two left-most cases, there's at least one sample that doesn't look like the others and instead looks closer to the neg control, reassuring it would not be the case that ALL gene knockdowns cause the given phenotype.

Is each dot here a well, btw?

Our analysis up to this point says "changes in this gene cause changes in this phenotype" but we did not explicitly aim to choose examples where "this phenotype" would be super unique relative to all other genes (right?) So I guess it's possible to get a phenotype that's also impacted by lots of other genes (esp if that phenotype is something 'generic' like cell growth... I don't recall whether we felt that these 5 genes gave generic vs unique phenotypes in general/by eye?)

(I have no actual conclusion, just thinking out loud).

mtegtmey commented 2 years ago

Is each dot here a well, btw?

Right, I was using per well level data for the analysis.

Our analysis up to this point says "changes in this gene cause changes in this phenotype" but we did not explicitly aim to choose examples where "this phenotype" would be super unique relative to all other genes (right?) So I guess it's possible to get a phenotype that's also impacted by lots of other genes (esp if that phenotype is something 'generic' like cell growth... I don't recall whether we felt that these 5 genes gave generic vs unique phenotypes in general/by eye?)

They were honestly chosen because we didn't have anything else to explore. They were the rare variant burdens with either suggestive or significant associations to features that were the most easily interpretable from a biological sense (the genes themselves).

AnneCarpenter commented 2 years ago

Yep, ok, so it's not necessarily discouraging that many of them impacted similar phenotypes (since we never aimed at choosing any that were distinctive)

mtegtmey commented 2 years ago

I'm pulling together images to see if we can visualize the changes by eye. Just looking for feedback to see if I am approaching this the right way. Below are representative images from a control well and one with cells where I've knocked down TSPAN15. The feature associates with this gene is Cytoplasm_Granularity_3_RNA, so I'm looking at the SYTO stain and for what my brain thinks is 'granularity' in the cell body.

I think on average, cells have more little holes/crevices in the knockdown sample, but I'm sure I'm am just trying to convince myself that's the case. In the differential test the TSPAN15 sample has a higher score for this feature relative to the control.

bethac07 commented 2 years ago

I think those holes are bigger than what Granularity_3 would likely be measuring. But one way to check if there's actually a visual difference is to scramble up a bunch of control and treated images and see what your blinded classification accuracy is.

For my money, I would bet that the only 2 potentially visible of the 5 phenotypes are the Eccentricity and the Radial CV.

AnneCarpenter commented 2 years ago

At a fine resolution based on this one pair of images, it does seem TSPAN15 is a bit blurrier in the cytoplasm, there are smaller textures in the control... but not sure if that's the right direction for this granularity metric (TSPAN15 KD is higher than control in the metric)

bethac07 commented 2 years ago

So Granularity 3 will mean "after removing larger features and subsampling (I would have to check the pipeline to recall if by a factor of 2 or 4), and remove the dots that are 1 pixel across or 2 pixels across, a relatively higher fraction of the data that's left is 3 pixels across and is removed when we remove dots 3 pixels across". I don't think a human brain can see that. (Erin has tried looking at some Granularity stuff by eye, unsuccessfully).

mtegtmey commented 2 years ago

OK, this makes sense. I will follow your advice Beth and see about training a classifier on the images.

I did look at some wells comparing PRLR and controls (feature of interest is Cells_RadialDistribution_RadialCV_Mito_1of4). I do feel like here I can eyeball a difference in where the mitotracker is staining throughout the cell. We expect the PRLR_sgRNA to have a lower score in this feature relation to the Control_sgRNA

bethac07 commented 2 years ago

I don't think you have to get as fancy as training a classifier, just scramble some images (by hiding the file names or renaming them - this is some code I wrote to randomize data, copy the mapping to an excel sheet, and then you could just hide the real name in the Excel file, write down your guesses, and then un-hide the real name to see if you're right.

RE: PRLR: That feature means that we expect in the mutant the mitochondrial distribution in the inside 1/4 of the cell (aka, immediately perinuclear) to be more symmetric (in the sense of not all the mitochondria on one side of the nucleus, but evenly distributed on all sides of the nucleus). I guess I COULD see that in the two images you posted, but there's a very good chance it's my brain tricking me, I'm not sure if I could pass a scramble test on it.

mtegtmey commented 2 years ago

I randomly sampled 8 images (4 from a control sgRNA and 4 from PRLR sgRNA). I was able to pick the four images from each condition without knowing their source. Here are the particular images I sampled in case anyone wants to check my sanity of distinguishing them.

AnneCarpenter commented 2 years ago

Matt tells me I got a perfect score too :D but didn't put my answer here so as not to contaminate anyone else who wants to try. FWIW, I was looking primarily at the 'stringiness' vs 'blobbiness' of mito esp in that ring around the nucleus, which I guess makes sense it (roughly) corresponds to being evenly distributed vs not as much.

broadinstitute / cmQTL

CRISPRi validation of rare variant associations #71