Closed AnneCarpenter closed 3 months ago
Email correspondence so far...
Hello, We've communicated before about spreading the word on faculty openings at Duke. I'm excited because I just discovered your papers about INSYN1. I wonder if our recent results might spark a collaboration?
To summarize a lot of work, we knocked down 8,000 genes one by one in U2OS human cells, and then clustered the genes based on having similar morphological impact (using the Cell Painting microscopy assay that labels major organelles).
We found a tight cluster of RAB40B and XLOC_I2_008134 which have the opposite morphological effects of a cluster of PIK3R3 and INSYN1.
I wonder if all of these connections are well-known already?
If not, we would be delighted to work together if you'd like to design an experiment to followup/confirm the connection and add to a paper we are beginning to write up about the large dataset. It always helps in such papers to make a new discovery that can be confirmed (even if it's not a dramatic finding).
All the best, please let me know if you would like to talk further!
Cheers, Anne
Anne E. Carpenter, Ph.D. (she/her) Institute Scientist & Senior Director, Imaging Platform Fellow: Merkin Institute, AIMBE, SLAS, Massachusetts Academy of Sciences, Royal Microscopy Society Broad Institute of Harvard and MIT 415 Main Street, Cambridge MA 02142
phone: (617) 714-7750 anne@broadinstitute.org http://www.broadinstitute.org/~anne
Anne Carpenter anne@broadinstitute.org Fri, Dec 1, 10:22 AM to scott.soderling
P.s. other interactions include: a connection between INSYN1 and RNF41, STYK1 and HOOK2. And, MYT1 is another gene that looks like it has the opposite impact from INSYN1.
just in case those ring any bells or seem worth pursuing!
Scott Soderling, Ph.D. Sat, Dec 2, 9:58 AM (13 days ago) to me
Hi Anne,
Sounds like a REALLY cool project! We actually don’t have any active projects right now on Insyn1 in the lab. These look like totally new links. Our work on Insyn1 was related to its function in neuronal inhibitory synapses and I don’t recall these coming out in the proximity proteomics from neurons. It looks like STYK1 and PIK3R3 may both be related to the regulation of the PI3K pathway, so that is interesting.
We have some reagents in the lab for Insyn1 and would be happy to send you any of them. We could potentially perform BioID U2OS cells for InSyn1 if that was helpful for you. We have a new method for doing proximity proteomics using CRISPR engineering of the native gene/protein. Or we would also be happy to send you constructs for InSyn1 BioID.
You may already be thinking about this, but we also have a fantastic new faculty member here from the MIT CSAIL that has developed new computational tools for PPI prediction using language models and protein structure (Rohit Singh). Might be interesting to run your clusters at large scale for PPIs. He would probably be interested in collaborating with you.
Best,
Scott
Anne Carpenter anne@broadinstitute.org Sat, Dec 2, 7:31 PM (13 days ago) to Ph.D.
Ah, too bad it's no longer a focus for you. My lab is computational, so sending reagents here won't help :D
Ideally, we'd test some functional relationships among these proteins - I'm not sure we would expect a direct protein protein interaction (at the least we ought to check in PPI databases for existing data to see if that is already known!). Will put that on my list.
Do you know other labs who still focus on InSyn1 function?
STATUS: waiting to be sure that these results are actually in the freshest version of the data before proceeding.
info provided here now https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/8
Unfortunately,
Some additional facts:
I Pinged Soderling today to find another lab who studies this, because it sounds like you are confirming the relationship DOES exist in JUMP ORF data.
Yes, relationship does exists in JUMP ORF:
JUMP ORF:
JUMP CRISPR:
If we can find someone to do followup experiments, we should check:
@holgerhennig Can your team answer that last Q - is there KG support for these interactions? That will help us know if we should write this up as already-known validation or increase our efforts to pursue followup experiments if it is novel.
@AnneCarpenter We'll look into it, whether there's KG support for these interactions, and get back to you asap
Got a response from Rytis in the meantime: Hi Anne,
No, we did not know about potential connection between Rab40b and INSYN1 or PIK3R3. Would be happy to chat with you to see whether we can do couple experiments to confirm the connections. Let me know what times work for you next week and I can set up zoom link.
How about other Rab40 isoforms (Rab40b and Rab40c)? Do they cluster as well? From our experience, there is substantial amount of functional redundancy between Rab40a, Rab40b and Rab40c. Rytis
We will schedule a meeting and depending whether Alan or Niranj can attend I can ask one of them to look up the other Rab40's.
I will meet with Rytis Jan 24th to brainstorm followup experiments.
@tjetkaARD can you please edit your comment above? I think it's a typo because two lines say "RAB40B vs. INSYN1". Given we are only looking at ORFs here I think my previous query to Niranj about XLOC_I2_008134 is now irrelevant but if you can confirm XLOC_I2_008134 is not in the ORF data that would be helpful! He also wanted to see if we have data for Rab40b and Rab40c (ORF or CRISPR?)
It would be great to show him a heatmap of all of these genes (while marking which ones pass our threshold for "has a phenotype"). Probably for ORF data only but if it's easy to make CRISPR we can do that just to be complete, in case Rab40b or c is interesting here.
@afermg Given this finding is in ORF data only, it would be great to use your web tool to look at images and most-similar/anti-similar genes. Please LMK when it's available. You're welcome to join the Jan 24 meeting as well if you like, similar to that FOXO one we had recently.
@AnneCarpenter @holgerhennig For the three genes PIK3R3, INSYN1, RAB40B we do not find "unsupervised" explanations in KG for any pair of these genes, so any connection between them is new from KG perspective. Of note, INSYN1 is not in the list of 4850 genes having a signal in ORF. As for XLOC_I2_008134, I can not find any trace of this gene (or pseudogene?) identity, can you point to its NCBI Gene ID?
I copied that one from an early heatmap (that likely is 'wrong' / outdated) so I guess let's abandon it. I believe someone else tried to find it in our data and the reagent doesn't exist (making it possibly a typo). But anyway if we are not seeing it as a top-similar gene in ORF or CRISPR we can abandon it. Thank you for checking!
@AnneCarpenter Answering in order:
Hence, showing correlation plot for ORFs (all replicable):
and correlation plot for CRISPRs (only RAB40B p-value replicable):
Unless Niranji points that the above file is incorrect one, all the genes in the ORFs heatmap are replicable.
@AnneCarpenter The ORF data interface is now available on broad.io/orf. The easiest way to search for things is querying genes on the search box, but you can also run queries by editing the gene name in the following URL: https://lite.datasette.io/?install=datasette-json-html&parquet=https://zenodo.org/api/records/10542737/files/orf.parquet/content#/data/content?Gene%2FCompound__exact=RAB40B&_sort=Distance
Just replace "RAB40B with other genes of interest.
AMAZING! I wish Github allowed gifs to express Kermit-the-frog level excitement https://giphy.com/clips/buzzfeed-buzzfeed-celeb-the-muppets-find-out-which-muppet-they-really-are-zTF0aDwhF239JQzIXw
We talked about shifting distance back to the -1 to +1 similarity metric that we've always used... but now I see values of 1300 for RBA40B so I'm wondering if what you're showing is not just our correlation metric * 1000?
Yes, actually I started using the metric Alex uses (cosine distance), ranging from 0 to 2 to be consistent his code. I was testing if it made a big difference to keep it in 1e3 units, but it doesn'tlook ideal. I'll set it back to decimals, but it will contain 10 digits (we can't limit them because it comes from the lite.datasette internal code. I'll update it and set a reminder here.
Hi @tjetkaARD, just to clarify; which correlation metric did you use for the analysis? Thanks!
Maybe even more to the point: where is the file with the similarity metrics that we currently recommend using for this paper?
Here is the link to Niranj's comment pointing to the latest version: https://github.com/jump-cellpainting/morphmap/issues/17#issuecomment-1750808647. That is the one I used. I still have the impression that using correlation (e.g., rank vs cosine distance) will change the ranking order.
Alán and I met with Rytis and Andrew Neumann today! To summarize next steps once Tomasz confirms the location of the data being used for all the heatmaps:
Side notes for when we write it up:
@afermg - we do use cosine similarity in all computetions as agreed with Niranji. Regarding the cosine vs. correlation - we can do a comaprison on small subset.
@AnneCarpenter - the general cosine-similarity matrix is too big to be stored on git (Hence it is ignored in commits). I talked with @niranjchandrasekaran to establish some place to put it, but I did not get any guidelines. Temporarily, till Monday, you can use the following link https://drive.google.com/drive/folders/1ViuT6pU7PWcjRUVcfb80KN6zXRTdIEwP?usp=sharing
Afterwards, we will clean the whole project and define some stable links.
@tjetkaARD Thanks for the info. Could you just check that the checksum matches? Our filename is transformed_inf_eff_filtered.parquet, with checksum e1ef247bf4725f35ab8e4793e6289600bdc71ff9. Also, do you have the code you used for cosine similarity somewhere public? That would make it much easier to find the source of discordance. Thanks
And in case it is of interest, our implementation is shown on https://github.com/broadinstitute/monorepo/blob/5aaa6073612515e1c7a096efef4708a9c6945ce1/libs/jump_rr/src/calculate.py#L96.
small note: i don't think we need to do a comparison between cosine similarity and correlation: we've done so in the past (tho I don't recall the outcome!) We should stick with the cosine similarity for consistency. @afermg If Qs remain, we may need to wait for Shantanu to be available next week, since Niranj is out.
- @afermg - generate a list of top morph features that distinguish RAB40B/C from INSYN1/PIK3R3 (or whatever the cluster is) and maybe grab images that demonstrate the phenotype if it's obvious/interpretable.
The clusters/correlations were generated using the non-interpretable features. Looking at all features for all pairs of perturbations is too computationally expensive and uncertain to give meaningful results. What I have done to produce broad.io/orf_feature is selecting the N (25) most distinct perturbations for every grouped feature set (e.g., I group all Texture_Granularity_8 for each mask, such as Nuclei, and channel, such as DNA). You can see the query that searches for distinctive features for all four genes here. I do not find any outstanding feature for INSYN1/PIK3R3, but I will wait for confirmation that we are using the same dataset before diving deeper into this, or doing the explicit comparison for these four perturbations.
Great!
Oh! I see what you've done now, you are solving the problem: for a given gene, what are the top features that distinguish it from neg controls? And you find 2 features ecah for RAB40B and C but none for the other two genes, is that right? I think I may be mistaken though!
I think it would be fine to average the two genes within each cluster first, then do a comparison so that final comparison is only pairwise: RAB40B/C averaged profile vs INSYN1/PIK3R3 averaged profile and find those top features.
I understand the dilemma that this may not be a thing that can be in a portal because it's too much to compute in advance. That's ok, we are just doing a few vignettes for the paper for now and can think about what to make available more broadly later.
(also fine to wait until the data settles, although this is ORFs which we are more confident in vs CRISPRs)
That is right. My take is that, although the perturbations may be directly comparable, it is likely that the difference is due to batch effects, as opposed to biology. I would expect their distinctive features to be more reliable when compared to the negative control.
I performed the analysis and pushed the resulting difference in features here. That is, median(RAB40*)-median(INS,PIK). The numbers are there, but they do not mean much without comparing them to perturbations that show no phenotype. I'll leave it for now to get input from someone who has found a decent way to find which feature differences are significant.
Hm, I was going to confidently say not to worry because the batch correction was pretty decent for this dataset... but then looked at your features and "total intensity" features being top ranked would certainly support your hypothesis! Unless maybe you are using uncorrected profiles; you had pointed to the Oct 2023 ORF profiles and it's plausible that those are the latest/correct (and corrected!) ones.
I think you're comparing to neg controls on the same plate ,and that's why you think it will be less batch-corruptible?
Are the units here St dev's? So like Image_ImageQuality_TotalIntensity_OrigER is 5,000 stdev's lower for one cluster vs the other?
(and we only want RAB40B and C not A, but I'm sure that's not the only problem here!)
The profiles are corrected until the point before Harmony, provided by John here. If they were corrected with in-plate negcons a priori then we should not need to subtract that again. I'm not sure what was the correction method, but we probably need some input/code on how this particular dataset was corrected.
Are the units here St dev's? So like Image_ImageQuality_TotalIntensity_OrigER is 5,000 stdev's lower for one cluster vs the other?
I am unsure, I don't have much info on the correction methods, beyond them being "cell count adjustment and mean correction". The numbers do feel too high to be standard deviations.
(and we only want RAB40B and C not A, but I'm sure that's not the only problem here!)
Yeah, that's just B and C, I should probably specify somewhere.
Ah, I see - good point; our batch correction makes the feature names non-meaningful, so if we are looking at feature names, by definition they are not batch corrected.
I think we need to call in support from @shntnu to help us figure if there's any way to readily answer the Q "what phenotype does this RAB40B/C cluster have? (esp in relation to PIK3R3/INSYN1)" (in addition to perhaps answering the query we gave Niranj above)
for a given gene, what are the top features that distinguish it from neg controls? And you find 2 features ecah for RAB40B and C but none for the other two genes, is that right? I think I may be mistaken though!
I don't think that's what @afermg is doing here; he is doing this:
selecting the N (25) most distinct perturbations for every grouped feature set
I think he is doing top and bottom 25
e.g. this query will return 50 results, as will a query for every feature.
This is not the same as "for a given gene, what are the top features that distinguish it from neg controls"
Regarding the quandary about non-interpretable features, we concluded here that we have no choice https://github.com/jump-cellpainting/morphmap/issues/17#issuecomment-1883721429
so we need to stick with the non-harmony-corrected features for this analysis, if we do need features names.
It sounds like what we want to do here is:
Do a Mann-Whitney test to find features that are different (then correct for multiple testing)
🤖 code below that would go at the end of compare_features.py
import scipy.stats as stats
import re
features = [col for col in clusters.columns if not re.match("^Metadata.*$", col)]
results = []
for feature in features:
cluster_A = clusters.filter(pl.col("Metadata_Cluster") == "RAB40B/C")[feature]
cluster_B = clusters.filter(pl.col("Metadata_Cluster") == "INS/PIK")[feature]
# Use t-test or Mann-Whitney U test based on data distribution
t_stat, p_val = stats.ttest_ind(cluster_A, cluster_B, equal_var=False)
results.append({"Feature": feature, "T-Statistic": t_stat, "P-Value": p_val})
import pandas as pd
results_df = pd.DataFrame(results)
# drop rows with NaN
results_df = results_df.dropna()
# Apply multiple testing correction
from statsmodels.stats.multitest import multipletests
corrected_pvals = multipletests(results_df["P-Value"], method="fdr_bh")[1]
results_df["Corrected P-Value"] = corrected_pvals
# Filter significant features
significant_features = results_df[results_df["Corrected P-Value"] < 0.05]
Yes, the latter would be ideal "what are the features that most distinguish these two groups of genes".
The other Q: "for a given gene, what are the top features that distinguish it from neg controls" is another thing we often want (and would be ideal to be available in a web tool eventually for all genes). It could be a sort-of shortcut to answer the above question in a pinch but it's messier and for this one vignette, it's ideal to instead do the mann whitney style approach you suggest.
I incorporated Shantanu's approach, it yields 1040 features with a significant (corrected) p-value, you can explore them here. I could not find a particularly overrepresented feature set at a glance, I am doubtful as to how to define which features are important and which are not. Does ranking these features by statistical significance mean anything relevant in this context?
In regards to the questions of comparing gene features relative to their negative control, I was under the assumption that one of the correction methods made things relative to their respective negcons, or does this only happen during whitening/sphering? The feature-querying datasette interface works under the assumption that features are corrected based on their corresponding negative controls, and thus comparable/rankable. Two pair of groups may have very different features, but unlinking these from batch effects while retaining interpretability is the tricky step I've figured.
---- (adding some questions necessary to know which images/tests may useful for finding bio significance)
Regarding the question "for a given gene, what are the top features that distinguish it from neg controls": to group perturbations individually we usually combine multiple plates/batches. Is a MW test on the multi-plate control and perturbation pairs of sets enough to find significant features? Can we actually know if we are distinguishing plate effect from single-feature significance?
I could have a look at the ~50 (flattened) images + ~150-250(?) controls for each perturbation, but so far it hasn't been a successful approach. I'll think of a different analysis to perform if it yields a more manageable number of features, but let me know if there is any low-hanging fruit analysis to perform with the current datasets.
Let's chat in checkin!
In the meantime, here are the top 38 features in terms of T statistic (which is reasonably correlated with the p values)
If I had to summarize, this seems like a case where the localization of a lot of stains is changing and thus seems to be a pretty broad phenotype... many of these are the distribution of all organelles relative to the cell edge. But cell size and related measures are not super high on the list so it seems more about shifting localization than major changes in cell shape (other than the nucleus - that has some shape change features prominent).
The images for the other cluster members don't come up in broad.io/orf for RAB40B as top-ranked so I can't look at images directly to affirm these thoughts.
I incorporated Shantanu's approach, it yields 1040 features with a significant (corrected) p-value, you can explore them here.
Please inspect the code closely; it was bot-generated. Looks right to me but I don't know polar and rarely use statsmodels so I can't be confident of the code.
I could not find a particularly overrepresented feature set at a glance, I am doubtful as to how to define which features are important and which are not. Does ranking these features by statistical significance mean anything relevant in this context?
Yes ranking by corrected p-value is reasonable. If you had different number of samples per test, it would be less straightforward because there wouldn't be a 1-1 between effect size and p-value, but that's not an issue here.
In regards to the questions of comparing gene features relative to their negative control, I was under the assumption that one of the correction methods made things relative to their respective negcons, or does this only happen during whitening/sphering?
You are correct – the normalized features are relative to the negcons.
However you would still need to compare group 1 (perturbation replicates) with group 2 (control replicates, from the plates that the perturbation replicates come from) to get a list of features that are signficantly different between the two groups.
The feature-querying datasette interface works under the assumption that features are corrected based on their corresponding negative controls, and thus comparable/rankable.
This is a valid assumption.
Regarding the question "for a given gene, what are the top features that distinguish it from neg controls": to group perturbations individually we usually combine multiple plates/batches. Is a MW test on the multi-plate control and perturbation pairs of sets enough to find significant features? Can we actually know if we are distinguishing plate effect from single-feature significance?
We are pooling across plates for both, perturbations AND controls, so the chances of a type I error (false positive) is low. So if you see a significant feature, it is unlikely it is a plate effect. I have not thought about this too deeply so please ponder if it seems amiss.
As per check-in and my meeting with Shantanu, the approach to this will be to include a p-corrected statistic and group associated sets of features:
Clarification:
then correct the p values from step 5 (fdr_bh)
Step 2 I take it?
Sorry, step 3. Feel free to tag Alex for his input when you've written it up.
I believe this issue is still waiting for three things (though the 2nd one is for ORF only, no crispr analysis needed at least for now because those profiles aren't finalized and we're pretty sure there aren't phenotypes there due to isoforms):
@niranjchandrasekaran - see "could you please check the case of INSYN1..." Q above @afermg - check the data source Tomasz points to relative to your broad.io/orf and crispr sites to be sure they're generating the expected rank ordered similarities; this will tell us what genes to convey to the collaborators (if not the original cluster shown above). Would also be nice to generate the full list of similar/opposite genes for each gene here, esp for RAB40A which is not like the others here. @afermg - generate a list of top morph features that distinguish RAB40B/C from INSYN1/PIK3R3 (or whatever the cluster is) and maybe grab images that demonstrate the phenotype if it's obvious/interpretable.
I don't think we are certain that we are using the same ORF dataset, I pointed towards it here. The simplest way to do so is for someone on the other side to check if the checksum of their file is e1ef247bf4725f35ab8e4793e6289600bdc71ff9
, as mentioned above.
Here how these connections look like with the newest set of ORF and CRISPR profiles
ORF
INSYN1 | PIK3R3 | RAB40A | RAB40B | RAB40C | XLOC_l2_008134 | |
---|---|---|---|---|---|---|
INSYN1 | 1 | 0.26 | -0.11 | -0.42 | -0.23 | -0.36 |
PIK3R3 | 1 | 0 | -0.34 | -0.21 | -0.41 | |
RAB40A | 1 | 0.19 | 0.16 | 0.02 | ||
RAB40B | 1 | 0.52 | 0.62 | |||
RAB40C | 1 | 0.36 | ||||
XLOC_l2_008134 | 1 |
The previously seen relationships between the genes are still there, though some connections are weaker.
CRISPR
Most of these connections are absent in CRISPR because they are either not present in the dataset or do not have a phenotype. Only PIK3R3-RAB40B connection is present in CRISPR but the connection is weak (cosine similarity: -0.11)
KG Apart from the connection between RAB40A/B/C, all the other connections are not seen in the KG.
These results are the same as what I said in the above comment: https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/4#issuecomment-2105321510
The heatmap shows the percentile of the cosine similarities (1 → similar, 0 → anti-similar). The text is the maximum of the absolute KG score (gene_mf__go
, gene_bp_go
, gene_pathway
). I set a KG threshold (like we previously had) of 0.4. If connections have a score lesser than this threshold, then the connection is considered to be unknown. The KG scores were downloaded from Google Drive: ORF and CRISPR. The diagonal of the heatmap indicates whether a gene has a phenotype (False
could also mean the gene is not present in the dataset).
This seems worth finalizing (i.e. re-creating the clusters based on what are the nearest neighbors of the genes involved rather than including genes just because they were in the original clusters with old profiles.)
Here are the recreated clusters
This latest re-created cluster is from the ORF data only. Is it worth querying the same genes ["INSYN1", "PIK3R3","RAB40B", "RAB40C"] in the CRISPR data?
Yes! It will be important to comment on whether the same pattern exists or doesn’t exist in the crispr data, so either way we should take a look at how it’s looking.
Cluster of RAB40B and XLOC_I2_008134 have the opposite phenotypes of a cluster of PIK3R3 and INSYN1 - googled most connections and got nothing so it may be novel. Emailed Scott Soderling Duke Dec 1 https://mail.google.com/mail/u/0/#sent/QgrcJHsHnNjJtWszNdwGctjDfRPknnWjVHl