Supplemental figure exploring the similarity between guides targeting the same gene

Hi all, I have produced several groups of heat-maps targeting this topic and would appreciate your feedback regarding how to best address this issue. Preferably selecting one of the provided options would be great help to me as I can move on to focus on other issues.

Here are the guides from the hit gene list with highest signal: A549_profile_heatmap_guide_level_top_50_hits HeLa_DMEM_profile_heatmap_guide_level_top_50_hits HeLa_HPLM_profile_heatmap_guide_level_top_50_hits

Here are the guides with high cell counts (All four guides targeting the gene have around 100 cells): A549_profile_heatmap_guide_level_high_guide_count_hits HeLa_DMEM_profile_heatmap_guide_level_high_guide_count_hits HeLa_HPLM_profile_heatmap_guide_level_high_guide_count_hits

And here are random 200 guides (targeting 50 genes) which were selected from a list of common hit genes among all 3 datasets: A549_profile_heatmap_guide_level_Random 50 common hits_hits HeLa_DMEM_profile_heatmap_guide_level_Random 50 common hits_hits HeLa_HPLM_profile_heatmap_guide_level_Random 50 common hits_hits

As a reminder that the representation at the gene level is much higher than the guide level I am including these distributions which are gonna be part of a separate supplemental figure: A549_cells_gene_distribution HeLa_DMEM_cells_gene_distribution HeLa_HPLM_cells_gene_distribution

@AnneCarpenter @calvinjan @jt-neal @bethac07 @ErinWeisbart @mlozada21

Great to have these views! In the first plot (for A549), it seems most genes/guides look like most others which made me wonder if they are mostly (a) showing not much signal, or (b) showing a signal but most of the 50 genes show the same signal (toxicity most likely). In the 2nd & 3rd plots (for HeLa cells) we see most genes/guides fall into two classes that anti correlate with each other. I mean, this plot looks better because we see stronger guide similarity but it’s a bit weird there’s only 2 classes of genes instead of a variety.

In both cases, it would be nice to see where these 50 are positioned in a map of the whole genome. Are they falling within 1-2 clusters of samples? If they are both falling into a big cluster of toxic/essential genes then actually the plots make sense - the samples are readily confused with each other because they are coming from the same 1-2 clusters! So of course the guides do not look self-similar compared to others in the same cluster.

Continuing on, the high-cell-count and random hits plots seem not so bad; some proportion of genes have a signal but there’s a lot of variety in what the profile is so things don’t correlate with each other much. I wish the proportion with a signal was higher for the random hits, since they’ve all been deemed hits.

I wonder: what is the metric used in these plots (some metric of correlation, but which one?), and does @shntnu have any guidance on whether it’s appropriate? I suppose that may explain why things don’t look as we expect/hope.

(As a reminder here are the hypotheses I posted in slack and your analysis addresses most, I think seeing where genes fall in an overall heat map, or a heat map with a random sampling of 1-2000 genes, would be helpful:

cell count per guide is pretty low (I think I heard a hundred-ish?): not sure what to do about that without repeating experiment with much higher cell count
A549 data quality is not great compared to HeLa: address this by repeating analysis in HeLa
our hit calling is not ideal and doesn’t pick 50 genes that have good signals: not sure how to address
top 50 genes have dramatic impact on morphology but most genes give the same phenotype as each other (eg similar type of toxicity): in the whole genome heatmap, do they cluster with each other?)

@MerajRamezani Is the code for this somewhere? The big red/blue split in the HeLa top 50 graphs feels funny to me, my "something weird" sense is tingling.

@bethac07 In my opinion this happens because certain groups of perturbations are highly visible with high scores because they target the compartments we are labeling in the assay. So by selecting top scores we are narrowing the perturbations into a few groups. I have looked at the gene level heatmaps and that is in correspondence with these heatmaps. Here is where you can find the code: https://github.com/broadinstitute/2022_PERISCOPE/blob/main/Supplemental_5/cp257_dmem_guide_check_heatmap_hits.ipynb

Great, thanks for pushing those! I'll try to take a look in the next couple of days.

@bethac07 In my opinion this happens because certain groups of perturbations are highly visible with high scores because they target the compartments we are labeling in the assay. So by selecting top scores we are narrowing the perturbations into a few groups.

Right, and I think you can make that argument more in the "high guidecount" plots, but in the "top 50" plots, we essentialy only have <5 phenotypes, or we wouldn't have such big, such dark red blocks. Those are very, VERY similar. So I'm just worried that the phenotype we have there is "dead/dying", which is perhaps not a super interesting phenotype to show. There is just so little of what we expected (tight dark-red blocks-of-4) and so much of what we would NOT expect (tons of off-diagonal similarity) that I'm suspicious. But code review is always a good idea, whether the results turn out the way we want them to or not!

broadinstitute / 2022_PERISCOPE

Supplemental figure exploring the similarity between guides targeting the same gene #16