RAB40B has the opposite phenotypes of PIK3R3/INSYN1: exploration for MorphMap paper (ORF)

AnneCarpenter commented 11 months ago

Cluster of RAB40B and XLOC_I2_008134 have the opposite phenotypes of a cluster of PIK3R3 and INSYN1 - googled most connections and got nothing so it may be novel. Emailed Scott Soderling Duke Dec 1 https://mail.google.com/mail/u/0/#sent/QgrcJHsHnNjJtWszNdwGctjDfRPknnWjVHl

only 4 papers for INSYN1: Scott Soderling at Duke (plus a genetics lab) scott.soderling@duke.edu Backups
RAB40B - cancer - 19 papers, R Prekeris is most common and one says “invadosome” which may be related to neuron projections (??)
PIK3R3 - cancer - 206 papers XLOC_I2_008134 - nothing in google search

AnneCarpenter commented 11 months ago

Email correspondence so far...

Hello, We've communicated before about spreading the word on faculty openings at Duke. I'm excited because I just discovered your papers about INSYN1. I wonder if our recent results might spark a collaboration?

To summarize a lot of work, we knocked down 8,000 genes one by one in U2OS human cells, and then clustered the genes based on having similar morphological impact (using the Cell Painting microscopy assay that labels major organelles).

We found a tight cluster of RAB40B and XLOC_I2_008134 which have the opposite morphological effects of a cluster of PIK3R3 and INSYN1.

I wonder if all of these connections are well-known already?

If not, we would be delighted to work together if you'd like to design an experiment to followup/confirm the connection and add to a paper we are beginning to write up about the large dataset. It always helps in such papers to make a new discovery that can be confirmed (even if it's not a dramatic finding).

All the best, please let me know if you would like to talk further!

Cheers, Anne

Anne E. Carpenter, Ph.D. (she/her) Institute Scientist & Senior Director, Imaging Platform Fellow: Merkin Institute, AIMBE, SLAS, Massachusetts Academy of Sciences, Royal Microscopy Society Broad Institute of Harvard and MIT 415 Main Street, Cambridge MA 02142

phone: (617) 714-7750 anne@broadinstitute.org http://www.broadinstitute.org/~anne

Anne Carpenter anne@broadinstitute.org Fri, Dec 1, 10:22 AM to scott.soderling

P.s. other interactions include: a connection between INSYN1 and RNF41, STYK1 and HOOK2. And, MYT1 is another gene that looks like it has the opposite impact from INSYN1.

just in case those ring any bells or seem worth pursuing!

Scott Soderling, Ph.D. Sat, Dec 2, 9:58 AM (13 days ago) to me

Hi Anne,

Sounds like a REALLY cool project! We actually don’t have any active projects right now on Insyn1 in the lab. These look like totally new links. Our work on Insyn1 was related to its function in neuronal inhibitory synapses and I don’t recall these coming out in the proximity proteomics from neurons. It looks like STYK1 and PIK3R3 may both be related to the regulation of the PI3K pathway, so that is interesting.

We have some reagents in the lab for Insyn1 and would be happy to send you any of them. We could potentially perform BioID U2OS cells for InSyn1 if that was helpful for you. We have a new method for doing proximity proteomics using CRISPR engineering of the native gene/protein. Or we would also be happy to send you constructs for InSyn1 BioID.

You may already be thinking about this, but we also have a fantastic new faculty member here from the MIT CSAIL that has developed new computational tools for PPI prediction using language models and protein structure (Rohit Singh). Might be interesting to run your clusters at large scale for PPIs. He would probably be interested in collaborating with you.

Best,

Scott

Anne Carpenter anne@broadinstitute.org Sat, Dec 2, 7:31 PM (13 days ago) to Ph.D.

Ah, too bad it's no longer a focus for you. My lab is computational, so sending reagents here won't help :D

Ideally, we'd test some functional relationships among these proteins - I'm not sure we would expect a direct protein protein interaction (at the least we ought to check in PPI databases for existing data to see if that is already known!). Will put that on my list.

Do you know other labs who still focus on InSyn1 function?

AnneCarpenter commented 11 months ago

STATUS: waiting to be sure that these results are actually in the freshest version of the data before proceeding.

AnneCarpenter commented 11 months ago

info provided here now https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/8

tjetkaARD commented 11 months ago

Unfortunately,

[RAB40B vs. PIK3R3/INSYN1] correlation cannot be found within the CRISPR datasets - it is correlation purely seen only in ORFs.

Some additional facts:

Within Morhpological CRISPR data , RAB40B vs. PIK3R3 has negative correlation of ~ (-10%)
Within DepMap CRISPR data, RAB40B vs. PIK3R3 has also negative correlation of ~ (-16%), both in ~top20/top50 anti-correlated genes of each other

AnneCarpenter commented 10 months ago

I Pinged Soderling today to find another lab who studies this, because it sounds like you are confirming the relationship DOES exist in JUMP ORF data.

tjetkaARD commented 10 months ago

Yes, relationship does exists in JUMP ORF:

JUMP ORF:

RAB40B, PIK3R3 and INSYN1 are fdr replicable
RAB40B vs. INSYN1, anti-correlated, coefficient: -0.62
RAB40B vs. PIK3R3, anti-correlated, coefficient: ~ -0.57
RAB40B, INSYN1, PIK3R3 all 'have the phenotype'. There are two separate ORFs perturbations (each with its own replicates) with symbol PIK3R3; both have replicable phenotype.

JUMP CRISPR:

RAB40B is p-value (less certain) replicable only, PIK3R3 is not replicable in terms of phenotype
INSYN1 is not measured in JUMP-CRISPR
RAB40B vs. PIK3R3, insignificant, coefficient: -0.1

AnneCarpenter commented 10 months ago

If we can find someone to do followup experiments, we should check:

are interactions among these genes already known in protein-protein interaction (PPI) networks? This cluster wasn't found through the Evotec route so we don't know for sure that these connections don't already have strong evidence in data/literature such as PPIs.
but, that is not worth doing until we can find someone who can think about experiments to do to follow this up.
I sent a note to Rytis Prekeris at CU Anschutz and also pinged Soderling.
Can @niranjchandrasekaran confirm that XLOC_I2_008134 is not in this ORF cluster of RAB40B, INSYN1, PIK3R3? (I suspect it was in the CRISPR results that we since have abandoned due to fresh versions of the connections)
Is there KG support for these interactions, or is it relatively unknown?

AnneCarpenter commented 10 months ago

@holgerhennig Can your team answer that last Q - is there KG support for these interactions? That will help us know if we should write this up as already-known validation or increase our efforts to pursue followup experiments if it is novel.

holgerhennig commented 10 months ago

@AnneCarpenter We'll look into it, whether there's KG support for these interactions, and get back to you asap

AnneCarpenter commented 10 months ago

Got a response from Rytis in the meantime: Hi Anne,

No, we did not know about potential connection between Rab40b and INSYN1 or PIK3R3. Would be happy to chat with you to see whether we can do couple experiments to confirm the connections. Let me know what times work for you next week and I can set up zoom link.

How about other Rab40 isoforms (Rab40b and Rab40c)? Do they cluster as well? From our experience, there is substantial amount of functional redundancy between Rab40a, Rab40b and Rab40c. Rytis

We will schedule a meeting and depending whether Alan or Niranj can attend I can ask one of them to look up the other Rab40's.

AnneCarpenter commented 10 months ago

I will meet with Rytis Jan 24th to brainstorm followup experiments.

@tjetkaARD can you please edit your comment above? I think it's a typo because two lines say "RAB40B vs. INSYN1". Given we are only looking at ORFs here I think my previous query to Niranj about XLOC_I2_008134 is now irrelevant but if you can confirm XLOC_I2_008134 is not in the ORF data that would be helpful! He also wanted to see if we have data for Rab40b and Rab40c (ORF or CRISPR?)

It would be great to show him a heatmap of all of these genes (while marking which ones pass our threshold for "has a phenotype"). Probably for ORF data only but if it's easy to make CRISPR we can do that just to be complete, in case Rab40b or c is interesting here.

AnneCarpenter commented 10 months ago

@afermg Given this finding is in ORF data only, it would be great to use your web tool to look at images and most-similar/anti-similar genes. Please LMK when it's available. You're welcome to join the Jan 24 meeting as well if you like, similar to that FOXO one we had recently.

auranic commented 10 months ago

@AnneCarpenter @holgerhennig For the three genes PIK3R3, INSYN1, RAB40B we do not find "unsupervised" explanations in KG for any pair of these genes, so any connection between them is new from KG perspective. Of note, INSYN1 is not in the list of 4850 genes having a signal in ORF. As for XLOC_I2_008134, I can not find any trace of this gene (or pseudogene?) identity, can you point to its NCBI Gene ID?

AnneCarpenter commented 10 months ago

I copied that one from an early heatmap (that likely is 'wrong' / outdated) so I guess let's abandon it. I believe someone else tried to find it in our data and the reagent doesn't exist (making it possibly a typo). But anyway if we are not seeing it as a top-similar gene in ORF or CRISPR we can abandon it. Thank you for checking!

tjetkaARD commented 10 months ago

@AnneCarpenter Answering in order:

I edited the comment above. In fact, the second line was to indicate anticorrelation against PIK3R3 - typo
@auranic XLOC_l2_008134 is in the ORF under the ID: ccsbBroad304_14513 (according to orf meta file) I cannot find any reasonable functional definition of it. I think we should just discard it.
We do have RAB40A/B/C in both datasets. They are all:
- replicable in ORF data
- not replicable in CRISPR data

Hence, showing correlation plot for ORFs (all replicable):

orfs_heatmap_cosine_chosen_RAB_

and correlation plot for CRISPRs (only RAB40B p-value replicable):

crispr_heatmap_cosine_chosen_RAB_

@niranjchandrasekaran - could you please check the case of INSYN1? According to the uploaded file to the github: https://github.com/jump-cellpainting/morphmap/blob/main/05.retrieve-orf-annotations/output/replicate-retrieval-mAP-transformed-inf-eff-filtered.csv.gz it is replicable:

Unless Niranji points that the above file is incorrect one, all the genes in the ORFs heatmap are replicable.

afermg commented 10 months ago

@AnneCarpenter The ORF data interface is now available on broad.io/orf. The easiest way to search for things is querying genes on the search box, but you can also run queries by editing the gene name in the following URL: https://lite.datasette.io/?install=datasette-json-html&parquet=https://zenodo.org/api/records/10542737/files/orf.parquet/content#/data/content?Gene%2FCompound__exact=RAB40B&_sort=Distance

Just replace "RAB40B with other genes of interest.

AnneCarpenter commented 10 months ago

AMAZING! I wish Github allowed gifs to express Kermit-the-frog level excitement https://giphy.com/clips/buzzfeed-buzzfeed-celeb-the-muppets-find-out-which-muppet-they-really-are-zTF0aDwhF239JQzIXw

We talked about shifting distance back to the -1 to +1 similarity metric that we've always used... but now I see values of 1300 for RBA40B so I'm wondering if what you're showing is not just our correlation metric * 1000?

afermg commented 10 months ago

Yes, actually I started using the metric Alex uses (cosine distance), ranging from 0 to 2 to be consistent his code. I was testing if it made a big difference to keep it in 1e3 units, but it doesn'tlook ideal. I'll set it back to decimals, but it will contain 10 digits (we can't limit them because it comes from the lite.datasette internal code. I'll update it and set a reminder here.

afermg commented 10 months ago

Hi @tjetkaARD, just to clarify; which correlation metric did you use for the analysis? Thanks!

AnneCarpenter commented 10 months ago

Maybe even more to the point: where is the file with the similarity metrics that we currently recommend using for this paper?

afermg commented 10 months ago

Here is the link to Niranj's comment pointing to the latest version: https://github.com/jump-cellpainting/morphmap/issues/17#issuecomment-1750808647. That is the one I used. I still have the impression that using correlation (e.g., rank vs cosine distance) will change the ranking order.

AnneCarpenter commented 10 months ago

Alán and I met with Rytis and Andrew Neumann today! To summarize next steps once Tomasz confirms the location of the data being used for all the heatmaps:

@niranjchandrasekaran - see "could you please check the case of INSYN1..." Q above
@afermg - check the data source Tomasz points to relative to your broad.io/orf and crispr sites to be sure they're generating the expected rank ordered similarities; this will tell us what genes to convey to the collaborators (if not the original cluster shown above). Would also be nice to generate the full list of similar/opposite genes for each gene here, esp for RAB40A which is not like the others here.
@afermg - generate a list of top morph features that distinguish RAB40B/C from INSYN1/PIK3R3 (or whatever the cluster is) and maybe grab images that demonstrate the phenotype if it's obvious/interpretable.
then Rytis' lab will check if the morphology of knocking down (using siRNA) is similar/opposite for the genes of interest, vs RAB40A/B/C triple knockout cell line (They usually use triple neg breast cancer cells MDA… )

Side notes for when we write it up:

Rytis points out that our ‘negative’ result (no phenotype for CRISPRing RAB40A/B/C is not surprising bc they have to KD all three of them to see any effect in their hands, given their functional redundancy, that is cool)
He also notes that RAB40A only in higher primates (B&C were around longer in evolution) so it is sensible that B/C show more similarity than to A
it would be more fun if the experimental confirmation were something quite distinct from morphology that got more to the mechanism of the relationship, however, morphology is fairly likely to work because RAB40's are involved in rho pathway that is involved in motility/trafficking so it's a pretty safe bet they will confirm things here. It won't be a simple repeat of our original experiment because it's in a different cell type, and using siRNA instead of ORF. Not the most dramatic, but still a nice confirmation if it pans out.

tjetkaARD commented 10 months ago

@afermg - we do use cosine similarity in all computetions as agreed with Niranji. Regarding the cosine vs. correlation - we can do a comaprison on small subset.

@AnneCarpenter - the general cosine-similarity matrix is too big to be stored on git (Hence it is ignored in commits). I talked with @niranjchandrasekaran to establish some place to put it, but I did not get any guidelines. Temporarily, till Monday, you can use the following link https://drive.google.com/drive/folders/1ViuT6pU7PWcjRUVcfb80KN6zXRTdIEwP?usp=sharing

Afterwards, we will clean the whole project and define some stable links.

afermg commented 10 months ago

@tjetkaARD Thanks for the info. Could you just check that the checksum matches? Our filename is transformed_inf_eff_filtered.parquet, with checksum e1ef247bf4725f35ab8e4793e6289600bdc71ff9. Also, do you have the code you used for cosine similarity somewhere public? That would make it much easier to find the source of discordance. Thanks

afermg commented 10 months ago

And in case it is of interest, our implementation is shown on https://github.com/broadinstitute/monorepo/blob/5aaa6073612515e1c7a096efef4708a9c6945ce1/libs/jump_rr/src/calculate.py#L96.

AnneCarpenter commented 10 months ago

small note: i don't think we need to do a comparison between cosine similarity and correlation: we've done so in the past (tho I don't recall the outcome!) We should stick with the cosine similarity for consistency. @afermg If Qs remain, we may need to wait for Shantanu to be available next week, since Niranj is out.

afermg commented 10 months ago

@afermg - generate a list of top morph features that distinguish RAB40B/C from INSYN1/PIK3R3 (or whatever the cluster is) and maybe grab images that demonstrate the phenotype if it's obvious/interpretable.

The clusters/correlations were generated using the non-interpretable features. Looking at all features for all pairs of perturbations is too computationally expensive and uncertain to give meaningful results. What I have done to produce broad.io/orf_feature is selecting the N (25) most distinct perturbations for every grouped feature set (e.g., I group all Texture_Granularity_8 for each mask, such as Nuclei, and channel, such as DNA). You can see the query that searches for distinctive features for all four genes here. I do not find any outstanding feature for INSYN1/PIK3R3, but I will wait for confirmation that we are using the same dataset before diving deeper into this, or doing the explicit comparison for these four perturbations.

AnneCarpenter commented 10 months ago

Great!

Oh! I see what you've done now, you are solving the problem: for a given gene, what are the top features that distinguish it from neg controls? And you find 2 features ecah for RAB40B and C but none for the other two genes, is that right? I think I may be mistaken though!

I think it would be fine to average the two genes within each cluster first, then do a comparison so that final comparison is only pairwise: RAB40B/C averaged profile vs INSYN1/PIK3R3 averaged profile and find those top features.

I understand the dilemma that this may not be a thing that can be in a portal because it's too much to compute in advance. That's ok, we are just doing a few vignettes for the paper for now and can think about what to make available more broadly later.

(also fine to wait until the data settles, although this is ORFs which we are more confident in vs CRISPRs)

afermg commented 10 months ago

That is right. My take is that, although the perturbations may be directly comparable, it is likely that the difference is due to batch effects, as opposed to biology. I would expect their distinctive features to be more reliable when compared to the negative control.

I performed the analysis and pushed the resulting difference in features here. That is, median(RAB40*)-median(INS,PIK). The numbers are there, but they do not mean much without comparing them to perturbations that show no phenotype. I'll leave it for now to get input from someone who has found a decent way to find which feature differences are significant.

AnneCarpenter commented 10 months ago

Hm, I was going to confidently say not to worry because the batch correction was pretty decent for this dataset... but then looked at your features and "total intensity" features being top ranked would certainly support your hypothesis! Unless maybe you are using uncorrected profiles; you had pointed to the Oct 2023 ORF profiles and it's plausible that those are the latest/correct (and corrected!) ones.

I think you're comparing to neg controls on the same plate ,and that's why you think it will be less batch-corruptible?

Are the units here St dev's? So like Image_ImageQuality_TotalIntensity_OrigER is 5,000 stdev's lower for one cluster vs the other?

(and we only want RAB40B and C not A, but I'm sure that's not the only problem here!)

afermg commented 10 months ago

The profiles are corrected until the point before Harmony, provided by John here. If they were corrected with in-plate negcons a priori then we should not need to subtract that again. I'm not sure what was the correction method, but we probably need some input/code on how this particular dataset was corrected.

Are the units here St dev's? So like Image_ImageQuality_TotalIntensity_OrigER is 5,000 stdev's lower for one cluster vs the other?

I am unsure, I don't have much info on the correction methods, beyond them being "cell count adjustment and mean correction". The numbers do feel too high to be standard deviations.

(and we only want RAB40B and C not A, but I'm sure that's not the only problem here!)

Yeah, that's just B and C, I should probably specify somewhere.

AnneCarpenter commented 10 months ago

Ah, I see - good point; our batch correction makes the feature names non-meaningful, so if we are looking at feature names, by definition they are not batch corrected.

I think we need to call in support from @shntnu to help us figure if there's any way to readily answer the Q "what phenotype does this RAB40B/C cluster have? (esp in relation to PIK3R3/INSYN1)" (in addition to perhaps answering the query we gave Niranj above)

shntnu commented 10 months ago

for a given gene, what are the top features that distinguish it from neg controls? And you find 2 features ecah for RAB40B and C but none for the other two genes, is that right? I think I may be mistaken though!

I don't think that's what @afermg is doing here; he is doing this:

selecting the N (25) most distinct perturbations for every grouped feature set

I think he is doing top and bottom 25

e.g. this query will return 50 results, as will a query for every feature.

This is not the same as "for a given gene, what are the top features that distinguish it from neg controls"

Regarding the quandary about non-interpretable features, we concluded here that we have no choice https://github.com/jump-cellpainting/morphmap/issues/17#issuecomment-1883721429

so we need to stick with the non-harmony-corrected features for this analysis, if we do need features names.

It sounds like what we want to do here is:

Set 1: all replicates of all perturbations in group 1
Set 2: all replicates of all perturbations in group 2

Do a Mann-Whitney test to find features that are different (then correct for multiple testing)

🤖 code below that would go at the end of compare_features.py

import scipy.stats as stats

import re

features = [col for col in clusters.columns if not re.match("^Metadata.*$", col)]

results = []

for feature in features:
    cluster_A = clusters.filter(pl.col("Metadata_Cluster") == "RAB40B/C")[feature]
    cluster_B = clusters.filter(pl.col("Metadata_Cluster") == "INS/PIK")[feature]

    # Use t-test or Mann-Whitney U test based on data distribution
    t_stat, p_val = stats.ttest_ind(cluster_A, cluster_B, equal_var=False)

    results.append({"Feature": feature, "T-Statistic": t_stat, "P-Value": p_val})

import pandas as pd

results_df = pd.DataFrame(results)

# drop rows with NaN
results_df = results_df.dropna()

# Apply multiple testing correction
from statsmodels.stats.multitest import multipletests

corrected_pvals = multipletests(results_df["P-Value"], method="fdr_bh")[1]
results_df["Corrected P-Value"] = corrected_pvals

# Filter significant features
significant_features = results_df[results_df["Corrected P-Value"] < 0.05]

AnneCarpenter commented 10 months ago

Yes, the latter would be ideal "what are the features that most distinguish these two groups of genes".

The other Q: "for a given gene, what are the top features that distinguish it from neg controls" is another thing we often want (and would be ideal to be available in a web tool eventually for all genes). It could be a sort-of shortcut to answer the above question in a pinch but it's messier and for this one vignette, it's ideal to instead do the mann whitney style approach you suggest.

afermg commented 10 months ago

I incorporated Shantanu's approach, it yields 1040 features with a significant (corrected) p-value, you can explore them here. I could not find a particularly overrepresented feature set at a glance, I am doubtful as to how to define which features are important and which are not. Does ranking these features by statistical significance mean anything relevant in this context?

In regards to the questions of comparing gene features relative to their negative control, I was under the assumption that one of the correction methods made things relative to their respective negcons, or does this only happen during whitening/sphering? The feature-querying datasette interface works under the assumption that features are corrected based on their corresponding negative controls, and thus comparable/rankable. Two pair of groups may have very different features, but unlinking these from batch effects while retaining interpretability is the tricky step I've figured.

---- (adding some questions necessary to know which images/tests may useful for finding bio significance)

Regarding the question "for a given gene, what are the top features that distinguish it from neg controls": to group perturbations individually we usually combine multiple plates/batches. Is a MW test on the multi-plate control and perturbation pairs of sets enough to find significant features? Can we actually know if we are distinguishing plate effect from single-feature significance?

I could have a look at the ~50 (flattened) images + ~150-250(?) controls for each perturbation, but so far it hasn't been a successful approach. I'll think of a different analysis to perform if it yields a more manageable number of features, but let me know if there is any low-hanging fruit analysis to perform with the current datasets.

AnneCarpenter commented 10 months ago

Let's chat in checkin!

AnneCarpenter commented 10 months ago

In the meantime, here are the top 38 features in terms of T statistic (which is reasonably correlated with the p values)

If I had to summarize, this seems like a case where the localization of a lot of stains is changing and thus seems to be a pretty broad phenotype... many of these are the distribution of all organelles relative to the cell edge. But cell size and related measures are not super high on the list so it seems more about shifting localization than major changes in cell shape (other than the nucleus - that has some shape change features prominent).

The images for the other cluster members don't come up in broad.io/orf for RAB40B as top-ranked so I can't look at images directly to affirm these thoughts.

shntnu commented 10 months ago

I incorporated Shantanu's approach, it yields 1040 features with a significant (corrected) p-value, you can explore them here.

Please inspect the code closely; it was bot-generated. Looks right to me but I don't know polar and rarely use statsmodels so I can't be confident of the code.

I could not find a particularly overrepresented feature set at a glance, I am doubtful as to how to define which features are important and which are not. Does ranking these features by statistical significance mean anything relevant in this context?

Yes ranking by corrected p-value is reasonable. If you had different number of samples per test, it would be less straightforward because there wouldn't be a 1-1 between effect size and p-value, but that's not an issue here.

In regards to the questions of comparing gene features relative to their negative control, I was under the assumption that one of the correction methods made things relative to their respective negcons, or does this only happen during whitening/sphering?

You are correct – the normalized features are relative to the negcons.

However you would still need to compare group 1 (perturbation replicates) with group 2 (control replicates, from the plates that the perturbation replicates come from) to get a list of features that are signficantly different between the two groups.

The feature-querying datasette interface works under the assumption that features are corrected based on their corresponding negative controls, and thus comparable/rankable.

This is a valid assumption.

Regarding the question "for a given gene, what are the top features that distinguish it from neg controls": to group perturbations individually we usually combine multiple plates/batches. Is a MW test on the multi-plate control and perturbation pairs of sets enough to find significant features? Can we actually know if we are distinguishing plate effect from single-feature significance?

We are pooling across plates for both, perturbations AND controls, so the chances of a type I error (false positive) is low. So if you see a significant feature, it is unlikely it is a plate effect. I have not thought about this too deeply so please ponder if it seems amiss.

afermg commented 10 months ago

As per check-in and my meeting with Shantanu, the approach to this will be to include a p-corrected statistic and group associated sets of features:

ORDER OF CORRECTION:
- C_G(G_F(FDR(P_g(F))) Meaning
- Calculate the p value of all occurrences of feature F (P_g)
- then correct the p value using the False Discovery Rate (FDR)
- Group features based on their hierarchy (G_F)
- Combine the corrected p values of these groups (C_G)

shntnu commented 10 months ago

Clarification:

Calculate the p value of all features
then adjust the p value to account for multiple testing https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html (fdr_bh)
Group features based on their hierarchy and compute combined p-value per group using https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.combine_pvalues.html (fisher); this will give you a p-value per group
then correct the p values from step 5 (fdr_bh)

afermg commented 10 months ago

then correct the p values from step 5 (fdr_bh)

Step 2 I take it?

shntnu commented 10 months ago

Sorry, step 3. Feel free to tag Alex for his input when you've written it up.

AnneCarpenter commented 9 months ago

I believe this issue is still waiting for three things (though the 2nd one is for ORF only, no crispr analysis needed at least for now because those profiles aren't finalized and we're pretty sure there aren't phenotypes there due to isoforms):

@niranjchandrasekaran - see "could you please check the case of INSYN1..." Q above @afermg - check the data source Tomasz points to relative to your broad.io/orf and crispr sites to be sure they're generating the expected rank ordered similarities; this will tell us what genes to convey to the collaborators (if not the original cluster shown above). Would also be nice to generate the full list of similar/opposite genes for each gene here, esp for RAB40A which is not like the others here. @afermg - generate a list of top morph features that distinguish RAB40B/C from INSYN1/PIK3R3 (or whatever the cluster is) and maybe grab images that demonstrate the phenotype if it's obvious/interpretable.

afermg commented 9 months ago

I don't think we are certain that we are using the same ORF dataset, I pointed towards it here. The simplest way to do so is for someone on the other side to check if the checksum of their file is e1ef247bf4725f35ab8e4793e6289600bdc71ff9, as mentioned above.

niranjchandrasekaran commented 6 months ago

connections

Here how these connections look like with the newest set of ORF and CRISPR profiles

ORF

	INSYN1	PIK3R3	RAB40A	RAB40B	RAB40C	XLOC_l2_008134
INSYN1	1	0.26	-0.11	-0.42	-0.23	-0.36
PIK3R3		1	0	-0.34	-0.21	-0.41
RAB40A			1	0.19	0.16	0.02
RAB40B				1	0.52	0.62
RAB40C					1	0.36
XLOC_l2_008134						1

The previously seen relationships between the genes are still there, though some connections are weaker.

CRISPR

Most of these connections are absent in CRISPR because they are either not present in the dataset or do not have a phenotype. Only PIK3R3-RAB40B connection is present in CRISPR but the connection is weak (cosine similarity: -0.11)

KG Apart from the connection between RAB40A/B/C, all the other connections are not seen in the KG.

niranjchandrasekaran commented 6 months ago

Notebook

These results are the same as what I said in the above comment: https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/4#issuecomment-2105321510

The heatmap shows the percentile of the cosine similarities (1 → similar, 0 → anti-similar). The text is the maximum of the absolute KG score (gene_mf__go, gene_bp_go, gene_pathway). I set a KG threshold (like we previously had) of 0.4. If connections have a score lesser than this threshold, then the connection is considered to be unknown. The KG scores were downloaded from Google Drive: ORF and CRISPR. The diagonal of the heatmap indicates whether a gene has a phenotype (False could also mean the gene is not present in the dataset).

ORF

ORF-connections-INSYN1-PIK3R3-RAB40A-RAB40B-RAB40C-XLOC_l2_008134

CRISPR

CRISPR-connections-INSYN1-PIK3R3-RAB40A-RAB40B-RAB40C-XLOC_l2_008134

AnneCarpenter commented 5 months ago

This seems worth finalizing (i.e. re-creating the clusters based on what are the nearest neighbors of the genes involved rather than including genes just because they were in the original clusters with old profiles.)

niranjchandrasekaran commented 4 months ago

Notebook

Here are the recreated clusters

ORF-connections-HOXC8-INSYN1-NRBP1-PIK3R3-RAB40B-RAB40C-ZFP36L1

jessica-ewald commented 4 months ago

This latest re-created cluster is from the ORF data only. Is it worth querying the same genes ["INSYN1", "PIK3R3","RAB40B", "RAB40C"] in the CRISPR data?

AnneCarpenter commented 4 months ago

Yes! It will be important to comment on whether the same pattern exists or doesn’t exist in the crispr data, so either way we should take a look at how it’s looking.

broadinstitute / 2023_12_JUMP_data_only_vignettes

RAB40B has the opposite phenotypes of PIK3R3/INSYN1: exploration for MorphMap paper (ORF) #4

ORF

CRISPR