YAP1 connections: exploration for MorphMap paper (ORF but need to check CRISPR)

AnneCarpenter commented 8 months ago

From the file in issue #7 MorphMap_gene_gene_scoring_data_with_CRISPR.xlsx ... we see several tabs of YAP1 connections. Anne should explore these.

AnneCarpenter commented 7 months ago

I am realizing @auranic that I need more orientation to the YAP tabs in the document linked above, before I can dig in to find some biology here. Do you mind writing out the description of what you've done in those tabs (defining the column names, etc) so I can get started?

It might be quicker to chat in real time, but then we will need a description for the paper anyway so I think it's Ideal if you can write it. (also note there is a Sheet1 and Feuiil5 tab at the end, I am guessing I should ignore those)

auranic commented 7 months ago

Hi Anne,

I can formally describe what we have in this table and specifically about the YAP connections, but I am afraid the table is not up-to-date. We reworked it in several aspects and a later (but not the latest) version was sent to you last week, taking into account our exchanges. These aspects are: 1) We focused only on gene-gene connections between 4850 genes passed the quality check in ORF dataset 2) We reworked the scoring of the functional similarity between genes computed from KG, taking into account the reliability of predictions 3) We understood that the connection between YAP1 and NFKB that we tried to "explain" with KG is not that intersting to follow up since it is already 'known'. 4) We included gene-gene similarities for 1058 genes from CRISPR analysis

So what I would suggest is to send you yet another update of the table collecting all scores for 4850 genes from ORF and 1058 genes from CRISPR + accompanying text in the format that could be incorporated into the manuscript + some suggestions of the figure panels to quantify/illustrate the table. Would it work for you?

AnneCarpenter commented 7 months ago

Yes, absolutely that would be great overall. I can look up the YAP1 connections therein and we can use it as a "this validates what is already known" (because we focused on YAP1/NFKB in our prior paper) so it's a nice non-novel vignette.

It is definitely time to start making material to put into the paper so it is very exciting to be finalizing everything!

AnneCarpenter commented 7 months ago

I'm assigning this one to @auranic - please assign it back to me when you've got the new version ready for me to look at.

auranic commented 7 months ago

@AnneCarpenter I am a bit confused here. In principle, everything is there in the global table of KG functional scores for all gene pairs that I passed to @tjetkaARD .

Can you point to me the place in Results where documenting that our KG GNN model can predict a connection between YAP1 and NFKB pathway would be appropriate? Then I can write a small paragraph on this and mark it for your revision. Would it work like this? Anyway, I will report on this here a bit later.

AnneCarpenter commented 7 months ago

Sorry to not be clear: if someone can give me a rank ordered list of most-similar and least-similar genes to YAP1 (in ORF and CRISPR), together with KG scores so I know what is new, I may be able to work with collaborators to do a small experiment to confirm any novel connections. This would go as a new section towards the end as a vignette about a discovery from the data. It's quite possible you've provided all the info needed but as a biologist I need an excel file or plain text list to look at :D Does that make it all clear?

auranic commented 7 months ago

My apologies, this was my misunderstanding. Then let me clarify the situation with YAP1:

1) The Excel file with the strongest ORF links with YAP1 is here : https://docs.google.com/spreadsheets/d/1fRETHuCjEUBqkh-UiH2f7WlzZbTmPRVD/edit?usp=sharing&ouid=118212717540670809962&rtpof=true&sd=true 2) In CRISPR dataset I do not see YAP1 gene, it seems to be filtered out 3) For ORF data, we do NOT see strong connection to NFKB pathway. The strongest is the link to IKBKB gene with the ORF similarity score=0.294 4) In the top ORF links we see connection of YAP1 to 9 actin binding proteins (enrichment pval=10^-5, CORO2A, CORO2B, CNN1, CNN2, WASL...) and 10 protein serine/threonine kinases (pval=10^-7, PRKCE, PRKD1, MAP2K6, PAK5, STK17A, GRK2, etc.). 5) From KG scoring, connection of YAP1 to protein kinases looks "explained" while connection to actine binding is "not explained" 6) Our GNN model (not using ORF data in any way) for predicting pathway for a gene, predicts "TRAF6 mediated NF-kB activation" for YAP1 as one of the top links. 7) Can be of interest negative ORF correlation of YAP1 with FOXP2 (-0.389) and FOXP3 (-0.517), well "explained" from the KG (score=0.957). This can be an indirect connection to inflammation and NFKB through RELA.

I let you judge if this justifies YAP1 for a "vignette" and sorry for the confusion, these observations were presented not systematically and probably led to a wrong conclusion that ORF data confirms known YAP1-NFKB pathway connection (this is not true), explained by Knowledge graph (this is true).

AnneCarpenter commented 7 months ago

Ah, ok! So I understand: what threshold do you consider explained vs unexplained? It looks like the vast majority of pairs have KG above 0.5 which I thought would be 'high' but I think that must not be so, based on what you're saying about actin binding proteins' connections being unexplained. Once I understand what threshold seems fairly novel, I will look at the gene connections below that KG threshold but still high morphology similarity - those are the new connections I would be excited to pursue.

No problem about the NFkB connection - our goal here wasn't to prove that this connection to YAP1 exists, as it's already proven in the literature. For what it's worth, in our past work we saw ORFs for TRAF2 and CDC42 (and I think STK3) were 'opposite' of YAP1 and WWTR1 and STK11. But it's possible some of these weren't in this experiment or didn't reach our threshold for 'having a phneotype'. (Rohban, et al. eLife 2017).

AnneCarpenter commented 7 months ago

Oh! I see now even in these top 164 genes you gave me, WWTR1 is the second-strongest similar, and there are some STK's and CDC and TRAF genes too (different isoforms apparently). So that's all consistent w our past results.

I would love to see a heatmap of morph similarity of the style we've seen for other gene clusters, for these 164 genes, with the KG values on top. Can that be done? If so I can share it with our YAP1 collaborators and see if they see any genes of particular interest.

FWIW, it looks to me like the 3 most high-morph and low-KG are CORO2A, CORO2B and GMIP, circled in red here. Still I await your perception of what is 'novel' in KG score terms.

(just made a simple scatterplot in Excel!)

Also I believe YAP1 isn't even in the CRISPR experiment at all (rather than not having a phenotype and being filtered)

auranic commented 7 months ago

Hi Anne,

For the new scale of the KG score, here is distribution of scores for gene_bp, for example (it is similar for gene_pathway): 1) For a randomly selected gene pair, the expected KG score is 0.25. Then, this value for me correspond to 'unexplained' gene pair. 2) The number of pairs with KG score >0.7 is only 5%, so I would call them 'explained' and suggest this threshold for this definition. 3) The KG score between 0.25 and 0.7 is somewhat 'grey zone', the genes are closer in KG than you would expect by random chance but "the explanation is weak". 4) The negative or close to zero KG scores are "surprisingly distant" gene pairs, with their functional similarity less than one would expect by random chance, it corresponds to genes existing at opposite peripheral regions of KG

So yes, pairs with YAP1 having KG score ~0.5 might not be called 'unexplained' but they are not that different compared to a randomly selected pair of genes.

Hope this makes sense to you

auranic commented 7 months ago

For YAP1 connections, here is the clustermap for abs(ORF similarities) > 0.4 (38 genes), with labels and colors corresponding to ORF similarity

here is the clustermap for the same genes, with labels and colors corresponding to KG score (max between all models)

(interestingly, there are two clusters, 'functionally similar' to YAP1 including kinases and 'functionally neutral' to YAP1 including actin binders)

and here is the clustermap for the same genes where the colors and clusters are defined based on ORF similarity but the label in the annotation is the KG score:

Is it what you asked for?

AnneCarpenter commented 7 months ago

Beautiful, exactly! I’m very excited to dig into this next week!! I sent a note to our U Penn collaborators Karin Eisinger and Ashley Fuller in case they can think of an experiment to do to confirm any of these novel connections.

"there are 3 connections to YAP1 that are the strongest morphological similarity and lowest amount of existing evidence (from literature, data sources, as captured by a knowledge graph): CORO2A, CORO2B and GMIP

The attached plot really just shows the bunch that are positively-correlated to YAP1 and the ~8 that are negatively correlated. The color is morph similarity but note an unusual thing: we put numbers on top of the chart that indicate "Knowledge graph score": anything about ~0.7 is pretty already-well-known. That helps us quickly find things that are under ~0.5 and therefore novel/not-previously-known as good candidates to investigate. You can look at the YAP1 horizontal line to see all of these are morphologically strong and there are other genes beyond those three I mentioned that are relatively novel, too."

AnneCarpenter commented 7 months ago

We will also want the same visualizations for CRISPR data, once the profiles are corrected.

niranjchandrasekaran commented 3 months ago

Notebook

The heatmap shows the percentile of the cosine similarities (1 → similar, 0 → anti-similar). The text is the maximum of the absolute KG score (gene_mf__go, gene_bp_go, gene_pathway). I set a KG threshold (like we previously had) of 0.4. If connections have a score lesser than this threshold, then the connection is considered to be unknown. The KG scores were downloaded from Google Drive: ORF and CRISPR. The diagonal of the heatmap indicates whether a gene has a phenotype (False could also mean the gene is not present in the dataset).

I first looked at the genes mentioned in this issue and other genes that were previously explored. The first one includes a bunch of genes that are known connections and the second one includes unknown connections.

ORF-connections-STK11-STK3-TRAF2-WWTR1-YAP1

ORF-connections-CORO2A-CORO2B-EBF1-FOXG1-FOXP3-GMIP-LCOR-TEX45-TFAP2A-VCAM1-YAP1

We wanted to check if some other experiment corroborated the unknown connections. We took gene expression values provided by our UPenn collaborators and compared them to the similarity values.

MorphMap_gene_gene_scoring_data.xlsx

But we did find the gene expression values to support the similarity values. The following are the details about the gene expression values provided by Ashley Fuller.

Attached is a spreadsheet with curated data from 2 datasets: 1) a gene expression microarray dataset from murine UPS tumors (Yap KO vs. Yap WT), and 2) the KP230 cell NB4A RNA-seq dataset from our Cell Systems paper (where we know Yap1 protein but not Yap1 gene expression was reduced following NB4A treatment). If a gene is not listed in a given dataset, it simply was not present.

For this first pass, I included the 8 genes negatively correlated with Yap1, and the 3 positively correlated genes lacking connections in the literature. To start, I would focus primarily on fold changes and slightly less on unadjusted p-values (if present). IMO the q-values/adjusted p-values are not super relevant to our purposes here, since we only care about a handful of genes yet adjustment was done for all genes/comparisons across the genome.

There are some genes that vary considerably in association with Yap1 status. Overall, changes in the cell line data are typically stronger than those in the bulk tumor data because the latter samples contained multiple cell types. The reported (log2) fold changes are only computed using the average of each set of replicates, but if you would like something in graph-form that depicts data from individual replicates, I can put that together for you.

Since these gene lists were from the previous version of the profiles, I decided to create a new list of known and unknown connections for YAP1. Since I don't have the expression values of all these genes, I thought I would first check the YAP1 coexpression values from https://coxpresdb.jp and see if they support the cosine similarity values. I plotted heatmaps of absolute cosine similarity and absolute coexpression values.

Known connections

YAP1-connections-known-cosine-similarity

YAP1-connections-known-coexpression

Unknown connections

YAP1-connections-unknown-cosine-similarity

YAP1-connections-unknown-coexpression

I didn't find a strong relationship between the cosine similarities and coexpression values. Perhaps the expression values from our UPenn collaborators will tell a different story.

cc @AnneCarpenter

AnneCarpenter commented 3 months ago

We will likely want to include this story in the paper so it's worth finalizing. We don't want to take the list of genes from prior analyses, but instead be sure to use the latest profiles to:

filter for 'has a phenotype'
select the correlating/anti-correlating genes to YAP1 in ORF data in morphology space
plot them in a heatmap with KG scores so we know what's new (& with similarity values in case that's what actually goes in the figure)
plot the co-expression data (from the public source) for those genes, as you've done in a heatmap
repeat for CRISPR data

If it's still true that "there are 3 connections to YAP1 that are the strongest morphological similarity and lowest amount of existing evidence (from literature, data sources, as captured by a knowledge graph): CORO2A, CORO2B and GMIP" then we may want to dive into those 3 a bit more by checking the literature and/or plotting expression data from Penn for just those genes (we only have Yap-perturbed samples in the Penn data, so we cannot look at co-expression across a wide range of samples as you did in the heatmaps here).

Does this make sense?

niranjchandrasekaran commented 2 weeks ago

Notebook

Here are the top 10 similar and dissimilar genes to YAP1 in the ORF dataset that have a low score in the KG. The values in the heatmap are the KG values. YAP1 is not present in the CRISPR dataset.

ORF-connections-AHDC1-B4GAT1-CNN1-CNN2-CORO2A-CORO2B-DELE1-FUT8-GRK2-INSYN1-LCOR-MYCT1-P2RX2-RNF19B-RTKN-SCAMP1-STK17A-SYT1-VGLL3-VGLL4-YAP1

Many genes that seem to have similar names, somehow score poorly in KG, like CNN1 and CNN2. I am not sure why that happens.

Here are the coexpression values for these genes

YAP1-similar-dissimilar-coexpression

The coexpression values don't correlate strongly with the cosine similarity values.

YAP1-similarity-coexpression

AnneCarpenter commented 2 weeks ago

I re-digested the whole thread, esp your May 24 message, and realized we can use mRNA data in three ways and I’m not sure which you did: 1- does gene X change in expression when YAP1 is perturbed? 2- does the pattern of mRNA expression levels when gene X is perturbed match the matter seen when YAP1 is perturbed? 3- does gene X expression change in a similar way as YAP1 expression across a bunch of perturbations?

Sounds like the coexpression data you used before was looking for (3)? (Or maybe 2?) And that the “absolute cosine similarity” was a success among known YAP connections because the entire top bar (YAP1) is blue, meaning there is strong similarity of all these genes’ coexpression with YAP1? Not sure why Absolute coexpression value is so much less dramatic. Then, for the unknown connections, we again see a pretty decent blue bar across top. Oh hang on, I think absolute cosine similarity refers to morph whereas absolute coexpression values refers to mRNA?? That makes more sense why you called this a dud, then!

I think I trust the Penn data as being much more relevant so we should do that analysis, but IIUC we can only look for (1) because we only have a subset of mRNA values. I think I would prioritize their first dataset (gene expression microarray dataset from murine UPS tumors (Yap KO vs. Yap WT)) rather than their second (KP230 cell NB4A) but running both is great if convenient.

If it's still true that the 3 most high-morph and low-KG are CORO2A, CORO2B and GMIP we could look at their images & features.

We should also do a plate layout analysis here. I see INSYN1 in this set which is making me a bit nervous since it showed up in several other clusters (seems to be cancer related so could be general tox/cell growth phenotype).

A comment made above that is likely to go into the paper since it seems still true is this: In the top ORF links we see connection of YAP1 to 9 actin binding proteins (enrichment pval=10^-5, CORO2A, CORO2B, CNN1, CNN2, WASL...) and 10 protein serine/threonine kinases (pval=10^-7, PRKCE, PRKD1, MAP2K6, PAK5, STK17A, GRK2, etc.). From KG scoring, connection of YAP1 to protein kinases looks "explained" while connection to actin binding is "not explained”.

For the paper we will want the heatmap to show the known + unknown so we can comment on both (and feel reassured that some strong known ones are there - which sounds like “WWTR1 is the second-strongest similar, and there are some STK's and CDC and TRAF genes too (different isoforms apparently).” (From my notes above).

niranjchandrasekaran commented 2 weeks ago

Sounds like the coexpression data you used before was looking for (3)? (Or maybe 2?) And that the “absolute cosine similarity” was a success among known YAP connections because the entire top bar (YAP1) is blue, meaning there is strong similarity of all these genes’ coexpression with YAP1? Not sure why Absolute coexpression value is so much less dramatic. Then, for the unknown connections, we again see a pretty decent blue bar across top.

Yes, the coexpression data I am using is looking for 3 (https://academic.oup.com/nar/article/51/D1/D80/6814455).

Oh hang on, I think absolute cosine similarity refers to morph whereas absolute coexpression values refers to mRNA?? That makes more sense why you called this a dud, then!

That's right.

I think I trust the Penn data as being much more relevant so we should do that analysis, but IIUC we can only look for (1) because we only have a subset of mRNA values. I think I would prioritize their first dataset (gene expression microarray dataset from murine UPS tumors (Yap KO vs. Yap WT)) rather than their second (KP230 cell NB4A) but running both is great if convenient.

I believe the file that was previously shared only contains the expression levels for a handful of genes that we were previously interested in. Can I get the whole dataset?

If it's still true that the 3 most high-morph and low-KG are CORO2A, CORO2B and GMIP we could look at their images & features.

Will do.

We should also do a plate layout analysis here. I see INSYN1 in this set which is making me a bit nervous since it showed up in several other clusters (seems to be cancer related so could be general tox/cell growth phenotype).

Will do.

For the paper we will want the heatmap to show the known + unknown so we can comment on both (and feel reassured that some strong known ones are there - which sounds like “WWTR1 is the second-strongest similar, and there are some STK's and CDC and TRAF genes too (different isoforms apparently).” (From my notes above).

Sounds good.

AnneCarpenter commented 2 weeks ago

I've asked Penn!

AnneCarpenter commented 2 weeks ago

"Here's the mouse tumor dataset. The dataset is publicly available (NCBI GEO: GSE109920). Columns K-T represent tumors from individual mice. Fold changes in column H show KP (control) vs. KPY (Yap1 knockout). Recall that the knockout is only in UPS cells, so the other cell types in the tumor do express Yap1." KP vs. KPY tumor differential expression_5309results.xlsx

I'm not sure if this suffices to address the analysis; if not LMK! @niranjchandrasekaran

niranjchandrasekaran commented 1 week ago

Notebook

Here are the results from the feature group similarity analysis and the cell images of the four genes. Similarity of the four genes is primarily due to nuclear features.

YAP1_subset_area_size_compartment

YAP1_subset_feature_group_channel

facet_grid_montage_YAP1

Notebook

Plate layout of the YAP1 cluster looks fine.

ORF-plate-layout-YAP1

niranjchandrasekaran commented 1 week ago

I'm not sure if this suffices to address the analysis; if not LMK!

Will take a look.

AnneCarpenter commented 1 week ago

Awesome, rarely do we see such a clear feature analysis outcome! Could we add neg controls from the same plate as YAP1 to the picture grid, to compare visually? And I assume the analysis was "how are this group of 4 different from neg control"?

niranjchandrasekaran commented 1 week ago

Notebook

The fold change values look better compared to the coexpression values against phenotypic cosine similarity (R^2 is still only 0.06). The right most point is CORO2A.

YAP1-similarity-fold-change

AnneCarpenter commented 1 week ago

What data points are shown in this plot? It's clearly a subset but seems to include some not-strong correlators so I wondered.

AnneCarpenter commented 1 week ago

If it's the top 10 similar and dissimilar genes to YAP1 in the ORF dataset that have a low score in the KG, then it'd be great to show which points are similar and which are dissimilar, and also to show which are pos and neg fold change (maybe color for one and shape for the other?) so we can see the directionality. Or you could just show non-absolute for both axes and maybe we can get a view of what's going on.

Overall, looks like maximum is 50% fold change? I'm not sure if that is dramatic but can run past the collaborators once the other Qs here are addressed.

niranjchandrasekaran commented 1 week ago

What data points are shown in this plot? It's clearly a subset but seems to include some not-strong correlators so I wondered.

Similarity of YAP1 to other genes (absolute cosine similarity) and fold change of that gene's expression when YAP1 is knocked out (absolute fold change). These are all top 10 similar and dissimilar genes in https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/10#issuecomment-2303137892.

If it's the top 10 similar and dissimilar genes to YAP1 in the ORF dataset that have a low score in the KG, then it'd be great to show which points are similar and which are dissimilar, and also to show which are pos and neg fold change (maybe color for one and shape for the other?) so we can see the directionality. Or you could just show non-absolute for both axes and maybe we can get a view of what's going on.

Will do.

niranjchandrasekaran commented 1 week ago

Notebook 2

Going back to the small YAP1 cluster (YAP1, CORO2A, CORO2B and GMIP)

Could we add neg controls from the same plate as YAP1 to the picture grid, to compare visually?

Here is the updated grid.

facet_grid_montage_YAP1

LUCIFERASE is from a same plate as YAP1.

And I assume the analysis was "how are this group of 4 different from neg control"?

In https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/10#issuecomment-2316050008, I was showing the median cosine similarity of the “mini” profiles of the four genes. I wasn't comparing them to negative control, though they all have a phenotype and therefore are different from negative control.

To report which features distinguish the cluster from negative control, I computed the average number of features per channel that were significantly different from negative control. Even here, most features are DNA features.

cluster	ER_percent	Mito_percent	DNA_percent	AGP_percent	RNA_percent
YAP1	0.225597	0.063449	0.322668	0.197397	0.190889

niranjchandrasekaran commented 1 week ago

Notebook

Or you could just show non-absolute for both axes and maybe we can get a view of what's going on.

YAP1-similarity-fold-change_directionality

AnneCarpenter commented 1 week ago

Great, thank you - does one-fold change here mean it's actually not a change at all? (ie values between -1 and 1 are impossible)

AnneCarpenter commented 1 week ago

I found the answer in the raw data file: there are no values between -1 and 1, which means most of the values for our set of top/bottom 10 genes are not remarkable. I will ask the data providers how we might tell if, say, 1.5 is remarkable. I did compute that 1.5 is in the top ~2% of values, and -1.5 in bottom ~2% so maybe this is indeed notable. Of the 20 genes, only three are around 1.5 fold change: CORO2A (+ morph, +mRNA) and VGLL3 (-morph, -mRNA) and CNN2 (+morph, -mRNA). Disappointing that we don't see a consistency (ie + morph always with +mRNA or vice versa), but cell type might influence that.

We discussed today in checkin: we don't trust the feature analyses in this thread very much: One is based on "what do these 4 genes have in common"- if all profiles are normalized to neg controls, then I suppose this means whatever signal is leftover is probably what makes the genes distinct from neg controls, but I'm not certain - and the fact that it points to DNA features but all genes (except GMIP1) look very similar to the neg control in the DNA channel makes me suspicious. We've never done this analysis so it's hard to be confident that this is a unique result.

The other is based on "what % of features in each category are significantly different for each of the 4 genes (or collectively)" but with feature selection the categories may have different proportions of strong vs weak features, making this potentially confounded.

PLus, it's all confounded by the fact that we are not using the final profiles but instead the one just before the steps where feature names become meaningless. Altogether, this makes us pretty cautious. For this cluster, since Mohammad spent so mcuh time thinking about this cluster and looking at images using the enriched single-cell analysis, it's probably better to go back to Rohban, Fuller et al Cell Systems 2022 and just read in plain text how YAP1 was distinctive and then see if we can see it by eye here in the images.

AnneCarpenter commented 1 week ago

I will followup on the above. In the meantime, are we confident about the KG values? I see they are very low between pairs w similar names: VGLL3/4 CORO2A/B CNN1/2 I'm suspicious that the KG doesn't see any of those pairs as likely strongly related!

niranjchandrasekaran commented 1 week ago

In the meantime, are we confident about the KG values? I see they are very low between pairs w similar names: VGLL3/4 CORO2A/B CNN1/2 I'm suspicious that the KG doesn't see any of those pairs as likely strongly related!

I was confused about this as well. I checked the files that Evotec shared with me and these seem to the KG values.

niranjchandrasekaran commented 1 week ago

Notebook

Just a sanity check to ensure that the KG values of those pairs of genes are indeed low.

gene_1	gene_2	gene_mf	gene_bp	gene_pathway
CNN1	CNN2	0.259	0.286	0.31
CORO2B	CORO2A	0.343	0.236	0.187
VGLL3	VGLL4	0.064	0.081	0.088

AnneCarpenter commented 1 week ago

My colleague got back to me: fold change can be subject to noise so she recommends using Q values instead (these will not be directional but I suppose you can add neg and pos sign to it depending on the direction of fold change). I looked up a few of our strongest-morph-similarity genes and they were not significant by q value but if you could re-do the scatterplot using that comlumn instead of fold change it would be more systematic to confirm whether there's truly nothing interesting to point out in this raw data.

"q-value is the metric for statistical significance. Q-values are just p-values corrected for the false discovery rate (5%).

You're correct that the fold-change computation is a little more complex than KP/KPY. They were computed using "Significance Analysis of Microarrays" (SAM), which was the gold-standard method for differential gene expression analysis in two-color probe-based microarrays. A bioinformatics expert did this particular analysis since it preceded my arrival at Penn by several years, but I had lots of experience with SAM in grad school and was able to back-calculate the fold changes in column H using Yap1 as an example. Essentially, the values in columns K-T are log2 values, so you have to unlog them. Then, taking the ratio of the averages (KP/KPY) gets you to within 1 decimal point of the fold-change in column H. Then a negative sign is added (-12.0 instead of 12.0) - this could be due either to a mathematical property that's escaping me or a setting in SAM. But what matters most is the interpretation: Yap1 is expressed 12-fold higher in KP tumors compared to KPY (AKA 12-fold lower in KPY compared to KP)."

niranjchandrasekaran commented 1 week ago

Notebook

but if you could re-do the scatterplot using that comlumn instead of fold change it would be more systematic to confirm whether there's truly nothing interesting to point out in this raw data.

I didn't plot the values because none of our genes have a significant fold change.

Metadata_Gene_Symbol	Similarity	Metadata_qvalue
AHDC1	0.396364	60.3713
CNN1	0.476689	84.6134
CNN2	0.372812	84.6803
CORO2A	0.558896	65.8483
CORO2B	0.466192	84.6134
FUT8	-0.253937	87.586
LCOR	-0.286104	75.7986
LCOR	-0.286104	87.586
MYCT1	-0.274379	84.6134
P2RX2	-0.251989	65.8483
RNF19B	0.36091	84.6134
RTKN	0.422219	86.5241
SCAMP1	-0.258267	87.586
SYT1	0.406036	87.586
VGLL3	-0.253174	87.586
VGLL4	-0.254708	65.8483

So, I thought I would check the cosine similarity of all the genes that show a significant fold change. Here I report the percentile of the cosine similarity values instead of the values themselves. Percentiles near 100 and 0 are the interesting ones.

Metadata_Gene_Symbol	Metadata_Fold_Change	Metadata_qvalue	cosine_similarity_percentile
CEP72	1.38714	3.90329	0.991001
MTMR9	1.95756	4.4704	0.981028
SPATA7	1.76291	2.38277	0.816803
HNRNPDL	1.49019	0.815263	0.689962
ALKBH3	1.28788	1.73065	0.581412
FAM107A	1.36986	2.38277	0.502607
TAF1B	1.64345	3.90329	0.437225
TRAF6	1.59328	0.815263	0.402705
RBM25	2.87034	0	0.397906
TLK1	1.27575	4.4704	0.363413
PER1	2.53779	3.20803	0.35923
SRSF5	1.49755	3.90329	0.355995
FAM117B	1.61891	1.73065	0.245488
MECP2	1.26004	3.20803	0.206302
RBM26	1.68995	2.38277	0.165922
SSBP1	1.61139	2.38277	0.0916675
IL20RB	1.5808	2.38277	0.0634643
YAP1	-12.0403	0	0

The top and bottom 2 or 3 genes are the ones that show a significant fold change and also are either similar or dissimilar to YAP1. When I checked the KG scores, they were all below 0.4. So these might be novel connections.

gene_1	gene_2	KG_score
YAP1	RBM26	0.368
YAP1	IL20RB	0.325
YAP1	CEP72	0.353
YAP1	SSBP1	0.388
YAP1	MTMR9	0.2
YAP1	SPATA7	0.336

Notebook

Here are cell images from our genes of interest (from https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/10#issuecomment-2303137892): facet_grid_montage_YAP1.pdf (Attaching a pdf because pngs aren't opening correctly).

Notebook

I also created the grid for the new cluster from above: facet_grid_montage_YAP1_2.pdf

AnneCarpenter commented 1 week ago

IIUC:

You started with the all the genes with mRNA fold changes below 5% (q value, out of the ~65k rows of data, I assume you ignored the lines without a gene symbol)
filtered those for (a) in our ORF dataset and (b) had phenotypic activity above threshold; this yielded 18 genes
of those, 3 had significant morphology similarity to YAP1: that is more than expected by chance (5% by definition, unless the ignoring lines without a gene symbol messes that up).
and all 3 of those have low KG values (IIUC you didn’t filter for low KG value going into this analysis):

YAP1 CEP72 0.353 YAP1 IL20RB 0.325 YAP1 MTMR9 0.2

That seems a fair analysis to do, and yielding more than expected by chance!

And you did this because:

starting with the genes that had high morph similarity to YAP1 (also filtered for phenotypic activity)
yielded zero genes with significant mRNA fold changes (by q value)

By eye I would say that YAP1/CEP72/MTMR9 have more sparkly AGP whereas luciferase/IL20RB are more blobby. I don’t notice differences in other channels. This could be due to the images being blurry vs sharp so we should check plate layout for this new set of 4 genes CEP72, YAP1, IL20RB, MTMR9

niranjchandrasekaran commented 1 week ago

Notebook

CEP72 and IL20RB are the in the same well position in two different plates in the same batch. But none of the genes are on the same well position or same plate/batch as YAP1.

ORF-plate-layout-YAP1_2

niranjchandrasekaran commented 1 week ago

Regarding https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/10#issuecomment-2318837076, @auranic said the following.

Firstly, our current approach consists in predicting the protein function and not protein family. The knowledge graph that we use does not have explicit information about the content of protein families and is blind to their naming.

As you see in the slides the pairs that you mentioned are not that closely related accordingly to STRING: only a weak co-citation relation for VGLL3/4, no physical interactions, no strong co-expression, etc.

We also verified the co-citation using the most recent version of Pubtator (and Pubtator data prior to 2020 was included in DRKG knowledge graph). I would characterize the literature connection between these pairs of proteins as weak.

For example, CORO2A and CORO2B proteins have been co-cited in 13 publications, most of them after 2020, most of co-citations are just listing coronins as a family. The physical interaction between CORO2A and CORO2B has been reported only in Bioplex 2.0 resource and was included in IntAct with MI-Score 0.35 (which is low) and by association (by coip and via “complex expansion”) – this is why it is probably not included into STRING DB. CORO2A and CORO2B are connected in the DRKG knowledge graph only through very generic Gene Ontologies, the most specific of which is ‘actin binding’. No common diseases, Anatomy, Compounds or Pathways, etc. The tissue expression pattern for this pair of genes seems to be quite different. Therefore, it is not surprising that the resulting KG score for them is low (0.343). You can see that for some other coronin pairs the KG score is higher (eg, CORO1A/B=0.779, CORO1B/2A=0.564, CORO2A/1A=0.522).

Of note, for CNN1/2 pair Pubtator mixes CNN with Convolutional Neural Network abbreviation which results in 132 cocitations but most of them are false positives.

Out of curiosity, I also made a global analysis of all pairs of similar protein names (most probably belonging to the same family). It shows that only roughly for half of such pairs one can expect an “explained” by KG score connection (>0.5), while for another half the KG score can be low, and this largely depends on the family. Eg, for Zinc fingers (ZNF), most of the pairs are not functionally related, while for MAPK family, all the pairs are very strongly connected. Another example, soluble carriers (SLC) seem to be relatively well connected but with a fraction of pairs characterized by a low KG score. Coronins (CORO) in this sense take somehow intermediate position (a bit less coherent than keratins – KRT, for example).

This assures us that the KG values that we have are what we would expect them to be.

2024_0902_Evotec_KGinFamilies.pptx

broadinstitute / 2023_12_JUMP_data_only_vignettes

YAP1 connections: exploration for MorphMap paper (ORF but need to check CRISPR) #10

Known connections

Unknown connections