Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper

AnneCarpenter commented 9 months ago

Here is the updated list they provided Dec 15 2023.

We are generally interested to find gene pairs where:

there is a strong Cell Painting ORF pos or neg correlation (column E: ORF_similarity_abs)
there is NOT a strong knowledge graph connection (there are 4 KG-only models, columns F-I)

We already found many connections between SLC and OR gene families and will pursue those (https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/6) but we would like more.

Here is the email thread in Anne's email "Broad/Evotec collaboration on MorphMap & knowledge graphs" https://mail.google.com/mail/u/0/#inbox/FMfcgzGtwDFhnZvdwPWGMLLbdpgNZBSM with excel file

Here are the meeting notes: https://docs.google.com/document/d/1iIwJ1V5ig8KtTD7P0vV-GH16f-AvprqU/edit

MorphMap_gene_gene_scoring_data.xlsx

tjetkaARD commented 8 months ago

Adding up to the above data, I am attaching the above Excel file with additional sheet that includes CRISPR similarity as well: edited - see file in link https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1901252123

There are several columns added, primarily:

CRISPR_similarity: the value of cosine similarity between CP profiles
crispr_status: specifying the status of the gene pair in the dataset (replicable / not replicable / not present)
coexpression in RNA-seq data
correlation from pooled CRISPR KOs (DepMap)
strenth in STRINGdb knowledge graph

All added columns have short explaination above column name.

First look insights (pairs with significant correlation both in ORF and CRISPR and without strong KG evidence):

Edited: see the top pairs in comment : https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1901252123.

AnneCarpenter commented 8 months ago

Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through?

It would be great to see a heatmap of the correlations among this set of ~15 genes for CRISPR and another heatmap for ORF because it appears there are actually mostly falling into a few blobs rather than 15 very independent relationships.

tjetkaARD commented 8 months ago

Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through?

In fact, I did allow for any direction of relationship. Specifically, I took: edited: see methodology in https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1901252123

and afterwards filtered against knowledge graphs (assumed that the average of [CC,MF,PT,BP] KG scores needs to be below 0.4). It seems that most of the pairs with inconsistent direction between CRISPR and ORF (~30-40% of all pairs in the table ) are filtered by the KG condition.

If we would to take only top 100 pairs with respect to absolute correlations, we would get the following results (only those with inconsistent direction, other are similar): edited: see results in https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1901252123

To be precise: I needed to edit the previous post and table and add two rows due to omission of filtering with respect to MF-based KG score. So there is now one pair with inconsistent directions in the previous procedure.

Heatmaps - in progress.

tjetkaARD commented 8 months ago

Heatmaps corresponding to the table:

Edited: see heatmaps in https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1901252123

I see only two repeating clusters:

GPR176, TSC22D1, DPAT1, CHRM4
ISOC2, ECH1, UQCRFS1, BCAT2, SARS2

AnneCarpenter commented 8 months ago

Anne will examine these two plots and choose gene pairs to experimentally followup w collaborators. @tjetkaARD will create exactly these plots but removing the constraint that it's BOTH Orf and CRISPR-correlated. He will start a new thread with those and at the least, those will be in the paper. Anne may also identify for vignettes in those.

tjetkaARD commented 8 months ago

I have updated the above plots.

Unfortunately, I do not have the full KG data for all pairs - only for the top ones - as in the original excel files. So, the annotations are scarce. Alternatively I can plot the average KG score instead of letters.

AnneCarpenter commented 8 months ago

Ah, ok, I will ask Evotec if they can provide that, although maybe we only need this for our own exploration and it isn't necessary for the paper and what we have is enough for exploration. I will think about this when I dive into looking at these connections. Thanks!

AnneCarpenter commented 8 months ago

(I've asked - and BTW it would be even better to show the actual value (average of KG columns) on the heatmap so we have a sense of the strength of the scores.

cyrenaique commented 8 months ago

Sorry for asking, but does this Evotec KG is different from stringDB PPI data, because otherwise I have some some code to get values from a list of genes... just in case if needed.

AnneCarpenter commented 8 months ago

THanks for offering! But indeed the Evotec KG is very different, it combines many sources of info (including PPI but also others)

AnneCarpenter commented 8 months ago

From Andrey Zinovyev of Evotec:

Hi Anne,

Thank you for this information, very exciting to see the progress along several lines!

Here is a folder with some materials that I hope can address most of your requests https://drive.google.com/drive/folders/1kKqx5B9VJGq47yN03P7z8CikHOWyWMwg?usp=sharing

It contains :

All scores from KG models (orf_scores_merged.zip file) merged with QC filtered ORF scores that Niranj sent to us on Monday. This merging does not contain the CRISPR-derived scores, but we can add them as well as other columns : however, I do not seem to have access to this github https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7 . Just in case my nickname is ‘auranic’
Top filtered links with large ORF scores and small KG scores (toplinks_unexplained.xls file), accordingly to unsupervised ‘Biological process’ model. For example, some of the CYP* connections I saw in the heatmaps in your PDF are indeed there.
Powerpoint presentation with some analysis results, including the analysis of SLC/OR pairs or genome-wide (but restricted to the QC filtered genes) scatterplots (ORF sim vs KG score). Also, pay attention to the network figures in the end where we highlight some of the clusters of “unexplained links” (including SLC/OR related but other with, for example, CYP* genes as hubs).

Please note that we decided to change the functional scoring of KG relations from the L2 percentile-based to Pearson correlation, it appears to be more interpretable in the end but does not strongly affect the gene pair selection. Also of note, so far there is no confidential data used in this work, all is based on publicly available knowledge graph analysis.

If any explanations will be needed, we will be happy to connect via email or a call. Best regards,

Andrei

tjetkaARD commented 8 months ago

@AnneCarpenter I wil take care of it and the full plots today - sorry last two days were crazy busy.

@cyrenaique regarding stringDb - in fact, I have already merged it within the excel shared in the comment https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1879032101 (last column). But the Evotec is much more comprehensive/sensitive.

Edit: in progress, trying to figure out incosistencies with previous list togeter with Niranji.

AnneCarpenter commented 8 months ago

Yes - I can elaborate on the Evotec knowledge graph: They take existing annotated sources (biological processes, pathways, molecular functions) as ground truth to train the graph (which is based on lots of underlying data sources) to properly predict those connections.

cyrenaique commented 8 months ago

Thanks Anne for the precisions. https://pubmed.ncbi.nlm.nih.gov/36370105/ it seems that stringdb also updated 01/2023 their way of computing/predicting scores, interesting...

AnneCarpenter commented 7 months ago

The above connections were filtered as being strong in both ORF and CRISPR. For ORF or CRISPR connections, we move to new issues: #11 for ORFs and soon a new one for CRISPRs when he's ready.

I think we should pursue the two clusters that @tjetkaARD noted above - these have strong (+/-) correlation in both ORF and CRISPR but are not (completely) strongly connected in the KG so I am making new issues for these: GPR176, TSC22D1, DPAT1, CHRM4: #15 ISOC2, ECH1, UQCRFS1, BCAT2, SARS2 #16

(this issue can be closed as soon as @tjetkaARD makes the new issue for CRISPR-only connections)

tjetkaARD commented 7 months ago

@AnneCarpenter

Unfortunately, we need another iteration for this issue. There has been two relevant changes for the final output:

The Knowledge Graph methodology changed
In the Excel file, shared here: https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issue-2044375145, the ORFs are not filtered according to their replicability (only correlation strength)

Fortunately, it does not impact the qualitative conclusions much (see the last section). However, in order to clean up everything and not allow any confusion - I will edit all above comments linking to the confirmed and most recent results below.

Methodology

Replicability:
- ORFs: only q-value replicable genes are included (based on https://github.com/jump-cellpainting/morphmap/blob/24839193460b9107e09bbf0480e50ee9faef4698/05.retrieve-orf-annotations/output/replicate-retrieval-mAP-transformed-inf-eff-filtered.csv.gz)
- CRISPRs: I will present results separately for both q-value and p-value replicable genes (due to very small intersection between ORFs and CRISPRs)
Knowledge Graph filter: based on file shared in: https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/7#issuecomment-1884777462, gene pairs are included if:
- average of (gene_mf, gene_bp, gene_pathway) is below 0.5 AND max of (gene_mf, gene_bp, gene_pathway) is below 0.8
Choice of top pairs:
- Top 8 pairs according to sum of absolute orf/crispr similarities AND
- Top 8 pairs according to sum of scaled absolute orf/crispr similarities AND
- Top 4 pairs from each quadrant (to allow associations of different signs) according to sum of absolute orf/crispr similarities

Summary

In terms of intersected ORF/CRISPR replicable genes:

Q-value replicable ORFs vs. Q-value replicable CRISPR: 284 common genes
Q-value replicable ORFs vs. P-value replicable CRISPR: 673 common genes

Plots of CRISPR vs. ORF similarities & Distribution of KG mean score (Q-value replicable ORFs vs. Q-value replicable CRISPR)

The above procedure gives:

p-val replicable CRISPRs: 23 gene pairs; 30 unique genes
p-val replicable CRISPRs: 28 gene pairs; 37 unique genes

Plots of CRISPR vs. ORF similarities & with annotated top unknown pairs with strogest signal between ORF&CRISPR scatter_similarity_orf_crispr_q_replicable_annot

Source files:

Merged orf, crispr, kg file (q-value replicable ORF and q-value replicable CRISPR): orf_crispr_pairs_q_replicable.csv

Heatmaps

The values in the square indicate the average KG score

ORFs similarities orf_heatmap_cosine_Unknown CRISPR ORF Top 4_labels

CRISPRs similarities crispr_heatmap_cosine_Unknown CRISPR ORF Top 4_labels

Conclusions

Despite updated methodology and updated computations, similar clusters are identified, possibly with slightly different specific genes highlighted. Specifically:
- Cluster "GPR176, TSC22D1, DPAT1, CHRM4" - TSC22D1, DPAGT1 still in top results; CHRM4&GPR176 are included, if (p-value replicable CRISPRS are considered);
- Cluster "ISOC2, ECH1, UQCRFS1, BCAT2, SARS2": SARS2, ECH1 still in top results; UQCRFS1 are included, if (p-value replicable CRISPRS are considered); ISOC2 is not replicable in ORFs; BCAT2 vs. other interactions have significantly increased in KG assessment.

AnneCarpenter commented 7 months ago

Thanks for all this analysis! I think it will help to discuss the methodology and rationale when we are together.

I want to summarize that I think all 3 of these are interesting:

clusters/anti-correlations in ORF data only
clusters/anti-correlations in CRISPR data only
clusters/anti-correlations in both

In each case, we don't want to pay attention to genes that do not 'have a phenotype' (ie are not replicable).

In each case, we will want some examples that are well-known (high KG) and some that are novel (low KG), but emphasizing the latter for now because they are harder to find and will take time to followup with biology experiments.

So you think we should pause work on #15 #16 #17 until after we meet?

jessica-ewald commented 7 months ago

Following this! I have started compiling information for the previously defined gene clusters, and from scanning the updated info it looks like some of it will still be useful, but I'll wait for confirmation before continuing.

tjetkaARD commented 7 months ago

@AnneCarpenter - accounting for the comments I have added to each specific issue https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/15 https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/16 https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/17

I think it is safe to proceed.

jessica-ewald commented 7 months ago

I'm afraid that I've gotten quite confused! I'll try and summarize what I do and don't understand.

There are often two paired heat maps with the same genes, one showing pairwise correlation in the ORF data and one showing pairwise correlation in the CRISPR data. The value in each cell corresponds to the strength of the KG connections between those two genes, with a "?" if that connection is not present in the KG. I assume that the color of each heat map cell corresponds to the correlation strength and direction (+/-) based on either the ORF or CRISPR morphological data. I'm unsure of:
- Which color corresponds to positive correlations?
- How were the list of genes in the heat maps chosen - were they the genes with the greatest disparity between the magnitude of morphological and KG similarity in the ORF data, the CRISPR data, or some combination of the two?
I'm unclear where exactly the three lists of genes (#15 ; #16 ; #17 are coming from. There are six different heat maps in the relevant issues (here, here, here, here, here, and here). I'm unsure of:
- whether all the heatmaps are still valid, or if some should be deleted because they are based on the previous KG / data that was not filtered for replicability
- which of the five heatmap posts each cluster comes from, and whether I should be looking at the ORF or CRISPR heatmaps (or both)
- sometimes I can find a cluster of genes in a heatmap that seems to correspond to one of the lists, but then there are other genes in the same cluster that are not included in the lists. Were the clusters filtered to remove connections that were explained by the KG? For example, the only heatmap that I can find with both POLRID and SPATA25 is here, but there are three other genes in that cluster.

Just want to clarify all of this before diving into databases/literature. Thanks in in advance!

AnneCarpenter commented 7 months ago

blue is positive, red is negative correlation. See #20 for our basic protocol to make the heatmaps.

I believe #15 #16 #17 have all come from this issue (where a signal was seen in ORF+CRISPR data, both) but it is possible that in some cases after revising our analysis one or the other result 'fell apart' (this may also happen with the chromosome arm correction currently happening for CRISPR data). This will all become verifiable when we have the power to make our own heatmaps and filter genes into them as we like, Tomasz is working on that. It will also allow expanding which genes are in a cluster (to include knowledge graph-positive gene pairs, which should provide helpful context.) Tagging @jessica-ewald and @Zitong-Chen-16

AnneCarpenter commented 7 months ago

Probably the actual task is finished in this issue (find clusters interesting in both ORF + CRISPR datasets) because we expect the clusters we already found to remain.

But leaving it open and assigning @zahrahanifehlou so that when the chromosome arm corrections are done, Tomasz can make the 'final' versions of these heatmaps.

zahrahanifehlou commented 7 months ago

the chromosome arm corrections are done(notebook). Also I calculated the replicated genes and their similarities in the original CRISPR profile and corrected profile. You can find them on this link

AnneCarpenter commented 7 months ago

(though please see my note on https://github.com/jump-cellpainting/morphmap/issues/162 before proceeding to use them)

broadinstitute / 2023_12_JUMP_data_only_vignettes