broadinstitute / jump_hub

Collection of JUMP documentation and projects for internal and public consumption
2 stars 1 forks source link

RAB30-NAT14: exploration of Evotec for MorphMap paper #9

Closed AnneCarpenter closed 3 months ago

AnneCarpenter commented 11 months ago

(Anne) RAB30-NAT14 Decided not to pursue bc it was in Evotec’s hit list before doing much normalization Emailed Evoted Nov 20. 2023 for permission to share w others

No papers with both genes mentioned Looked up NAT14 papers and chose the 2 from past decade (see email) RAB30 has more papers (20 in the past decade) Only one senior author has 2 papers in past decade: Nakagawa I. A second lab has the only paper that jumps out as cell bio/molec (vs genetics or a review paper) RAB30 regulates PI4KB (phosphatidylinositol 4-kinase beta)-dependent autophagy against group A Streptococcus. Nakajima K, Nozawa T, Minowa-Nozawa A, Toh H, Yamada S, Aikawa C, Nakagawa I. Autophagy. 2019 Mar;15(3):466-477. doi: 10.1080/15548627.2018.1532260. Epub 2018 Oct 18. PMID: 30290718 Free PMC article. Here, we elucidate a novel property of RAB30: the ability to recruit PI4KB (phosphatidylinositol 4-kinase beta) to the Golgi apparatus and GcAVs. ...Furthermore, we identify an interaction between RAB30 and PI4KB, in which the knockdown of RAB30 decreased the … A role for Rab30 in retrograde trafficking and maintenance of endosome-TGN organization. Zulkefli KL, Mahmoud IS, Williamson NA, Gosavi PK, Houghton FJ, Gleeson PA. Exp Cell Res. 2021 Feb 15;399(2):112442. doi: 10.1016/j.yexcr.2020.112442. Epub 2021 Jan 5. PMID: 33359467 Rab30 is a poorly characterized small GTPase. Here we show that Rab30 is localised primarily to the TGN and recycling endosomes in a range of cell types, including primary neurons; minor levels of Rab30 were also detected throughout the Golgi stack and early …

niranjchandrasekaran commented 6 months ago

connections

I am not sure if this is still seen in the Evotec results, but there is a strong correlation between NAT14 and RAB30 in ORF (0.7)

niranjchandrasekaran commented 6 months ago

Notebook

This connection is novel.

The heatmap shows the percentile of the cosine similarities (1 → similar, 0 → anti-similar). The text is the maximum of the absolute KG score (gene_mf__go, gene_bp_go, gene_pathway). I set a KG threshold (like we previously had) of 0.4. If connections have a score lesser than this threshold, then the connection is considered to be unknown. The KG scores were downloaded from Google Drive: ORF and CRISPR. The diagonal of the heatmap indicates whether a gene has a phenotype (False could also mean the gene is not present in the dataset).

ORF

ORF-connections-NAT14-RAB30

CRISPR

CRISPR-connections-NAT14-RAB30

AnneCarpenter commented 5 months ago

It looks like the next step is to email researchers working on these; I asked Holger at Evotec to do so but he never replied so the thread was lost. Can you recap the story, suitable for pasting into an email?

I think it's something like this: "Overexpression of these two genes yields morph profiles that strongly correlate, and a relationship between the genes is unknown. We do not see a morph impact of either gene when knocked down by CRISPR." unnamed

Is that all there is to say? @niranjchandrasekaran

jessica-ewald commented 4 months ago

Hi @AnneCarpenter - I just searched both of these genes in many databases and came to the same conclusions as you. RAB30 seems fairly well characterized, while NAT14 has only predicted function. The top papers related to NAT14 are either generic functional genomics studies where it was one hit out of many, or papers that rule out NAT14 for particular functional roles within pathways.

I think that your succinct summary above captures the situation. Are we still planning on reaching out to researchers?

AnneCarpenter commented 4 months ago

I don't think there is time to do so now, unfortunately, but this could still be a nice story for the paper.

I think @niranjchandrasekaran would need to make sure the relationship still holds in the latest data first, then run the analysis to show what features are key.

I think if looking at the images themselves (guided by what features are key) may be enough of a story if it's a visible phenotype. The fact that Golgi seems involved indicates it may be visible in Cell Painting!

jessica-ewald commented 4 months ago

Ok, sounds good. I will find images of these two perturbations, and wait for @niranjchandrasekaran to extract key features.

AnneCarpenter commented 4 months ago

Great - Alán's tools will help you get images, though the colors will be merged which may not suit the goal.

jessica-ewald commented 3 months ago

So it's not clear to me that there is a visible phenotype from the images, but maybe someone else with more experience looking at images can see something. Here are RAB30 and NAT14, along with controls from the same plate:

image

image

image

image

AnneCarpenter commented 3 months ago

awesome, let's see what Niranj's list of features tells us so we know what to look at.

AnneCarpenter commented 3 months ago

ps @jessica-ewald would it be useful to note how you retrieved these images in case others do the same?

jessica-ewald commented 3 months ago

I can add the notebook to the repo with a pull request. I used Alan's library to retrieve the images, then wrote my own function to rescale/display them.

jessica-ewald commented 3 months ago

Notebook with function here: https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/blob/main/notebooks/display_orf_images.ipynb

niranjchandrasekaran commented 3 months ago

Notebook

Here are the list of top features that are significantly different between NAT14, RAB30 and the negative controls: sorted by p values of NAT14 features and sorted by p values of RAB30 features.

Here is the list of all features that are significantly different for both genes.

Note: I removed ObjectNumber, Location (features with X and Y in their names) and correlation features. I have also removed features which measure similar things.

I also looked at the similarity of features, grouped by feature groups, compartment and channels, between the two clusters in ORFs. The similarity is low for AreaShape features. All other feature groups are similar.

NAT14-RAB30_area_size_compartment

NAT14-RAB30_feature_group_channel

AnneCarpenter commented 3 months ago

I couldn't open the list files, it wouldn't let me decompress them which is surprising.

But overall based on the chart, this set of features looks fishy - very little change in Area/shape but then all the channels and all the texture/intensity features affected. I wonder what is going on and maybe looking at the features will make it more clear, esp the list of all features that are significantly different for both genes.

@niranjchandrasekaran can you please remind me what the plot shows? You said "similarity of features" (would that be correlations?) But I wonder if the plot might instead be the "list of all features that are significantly different for both genes" categorized (but then I'm not sure what the numerical value is).

niranjchandrasekaran commented 3 months ago

can you please remind me what the plot shows? You said "similarity of features" (would that be correlations?) But I wonder if the plot might instead be the "list of all features that are significantly different for both genes" categorized (but then I'm not sure what the numerical value is).

The way I do it is to create mini profiles using only those features in each group/compartment/channel and then find the cosine similarity between these profiles for the two genes.

AnneCarpenter commented 3 months ago

Thanks! In checkin Niranj noted that these two genes are nearby (same column, same plate) so it does point to a technical artifact. Next step is Jess will check if other similar genes are also nearby (or generally look at similarity to these two genes in a plate layout view). It may rule out this cluster but also point to some technical issue that needs filtering.

niranjchandrasekaran commented 3 months ago

Notebook

This isn't looking good. These are the top genes that are most similar to both RAB30 and NAT14. They are all either on the same plate or in the same batch. Also, many of them are in the same column. I think this story might be a dead end. I need to check the other “novel” connections to ensure the connections are not explained by layout.

Metadata_Symbol Metadata_Plate Metadata_Well Metadata_Batch
CLDN3 BR00123947 I01 2021_06_07_Batch5
SRI BR00123947 O01 2021_06_07_Batch5
SOCS2 BR00123947 M01 2021_06_07_Batch5
ALMS1P1 BR00123947 C01 2021_06_07_Batch5
PPCDC BR00123947 A01 2021_06_07_Batch5
RAB30 BR00123947 G01 2021_06_07_Batch5
NAT14 BR00123947 E01 2021_06_07_Batch5
TREML2 BR00123952 I01 2021_06_07_Batch5
IL26 BR00123952 C13 2021_06_07_Batch5
IL26 BR00123952 C01 2021_06_07_Batch5
ASPDH BR00123952 G01 2021_06_07_Batch5
CRYGS BR00123957 C02 2021_06_07_Batch5
TM2D2 BR00123957 C01 2021_06_07_Batch5
CEP104 BR00123957 G01 2021_06_07_Batch5
AnneCarpenter commented 3 months ago

Really glad we caught it! If we are lucky this just means a single plate needs to be thrown out or something. I wonder the best way to get an overview of how things look after all the steps/batch correction we did. Didn't @alxndrkalinin look at some plate layouts early on (not sure whether it was ORFs, CRISPRs or compounds)? If so Alex can you point to code you used to get an overview of features in plate layout format?

I think I would've recommended that we look at something like cell count, cell size, and then some random thing like cytoplasm mito intensity in plate layout view for every plate in a given dataset, just laying them all out to get a view.

jessica-ewald commented 3 months ago

So idk what's going on here, but when I pull location info for all genes using jump_portrait, it says that RAB30 and NAT14 are in the same physical wells:

image

This can't possibly be right - what does the metadata that @niranjchandrasekaran have say? I see the two wells in his screenshot above, but this further confuses me because in the jump_portrait metadata there are many wells per gene.

alxndrkalinin commented 3 months ago

Yep, here's the code for plotting feature values across plate layout, plus example notebook showing such plots, and plots that we generated for ORF data.

niranjchandrasekaran commented 3 months ago

Notebook

@jessica-ewald I suspect there is some kind of mapping error in jump-portrait. I checked the contents of the wells you shared in your screenshot. All wells that don't contain NAT14 or RAB30 seem to be negative controls.

Metadata_Plate Metadata_Well Metadata_Symbol
BR00123947 A21 LUCIFERASE
BR00123947 C05 BFP
BR00123947 C08 BFP
BR00123947 E01 NAT14
BR00123947 E02 LacZ
BR00123947 E07 HcRed
BR00123947 E14 LUCIFERASE
BR00123947 F22 LacZ
BR00123947 G01 RAB30
BR00123947 J05 LUCIFERASE
BR00123947 J11 BFP
niranjchandrasekaran commented 3 months ago

cc @afermg https://github.com/broadinstitute/2023_12_JUMP_data_only_vignettes/issues/9#issuecomment-2276305502

jessica-ewald commented 3 months ago

@niranjchandrasekaran we figured it out. There was an extra flag in the function call that I didn't know to use to retrieve the perturbation instead of the negative controls matched to the perturbation of interest.

So - this connection still seems like it could be explained by well position, but at least there isn't something totally wrong going on here 😁

niranjchandrasekaran commented 3 months ago

I think I would've recommended that we look at something like cell count, cell size, and then some random thing like cytoplasm mito intensity in plate layout view for every plate in a given dataset, just laying them all out to get a view.

@AnneCarpenter Erin previously ran this: https://github.com/jump-cellpainting/morphmap/issues/6#issuecomment-2136035514

jessica-ewald commented 3 months ago

Ok - I'm going to drop pursuing this story from my list. Let me know if there are any other action items for me!

AnneCarpenter commented 3 months ago

Thanks, Niranj, that link to Erin's issue helps. Here are the plate views for an intensity metric (Meanintensity AGP in cells) for the plate of interest here - at the top left (BR00123947). The first column has all of the X01 wells. Honestly I was hoping this plate would be some obvious terrible outlier but that doesn't seem to be the case. Note it's a bit hard to tell in these plots because each plate has its own scalebar - most seem to have most samples between + and - 2 and most have the same pattern of lower values in the middle/lower part of the plate than upper/sides.

Now, I don't know exactly what stage of profiles these are (the issue says before sphering and harmony - that's the only way we can get feature names) but I guess if the profiles we're looking at in these plate layouts is AFTER plate layout correction then it's not great that a relatively subtle plate layout like this yields something so misleading in the connections between compounds.

2021_06_07_Batch5_Cells_Intensity_MeanIntensity_AGP

I'm trying to recall and need @alxndrkalinin help - didn't we attempt a thing where we tried to mean-average each well position across all plates in the experiment? That would have gotten rid of this pattern but I don't know if that ended up in final profiles and/or if that step happens after sphering/harmony (I don't think that is the case).

AnneCarpenter commented 3 months ago

In the meantime, it's pretty clear this pairing of compounds is artifactual, based on Niranj finding the nearby wells all rank similarly highly to each other.

I think we should include a warning about this in the paper... I guess as a supplemental figure and a warning to check for any pairing if it can be explained by proximity in wells/plates/batches? I mean, ideally we would fix the data so this never happens, but since that's not practical I think all we can do is offer a way for people to check if this is what is happening. @niranjchandrasekaran could you write a sentence or two in the paper and a pointer to a supp figure suggesting the steps to avoid getting fooled?

niranjchandrasekaran commented 3 months ago

I'm trying to recall and need @alxndrkalinin help - didn't we attempt a thing where we tried to mean-average each well position across all plates in the experiment? That would have gotten rid of this pattern but I don't know if that ended up in final profiles and/or if that step happens after sphering/harmony (I don't think that is the case).

Alex, correct me if I am wrong. We do subtract the mean feature value of each well position from each feature, and that is the first step along the current profile processing pipeline (for ORF and CRISPR).

I think we should include a warning about this in the paper... I guess as a supplemental figure and a warning to check for any pairing if it can be explained by proximity in wells/plates/batches? I mean, ideally we would fix the data so this never happens, but since that's not practical I think all we can do is offer a way for people to check if this is what is happening. @niranjchandrasekaran could you write a sentence or two in the paper and a pointer to a supp figure suggesting the steps to avoid getting fooled?

Will do.

Closing this issue as we won't include this in the manuscript.

alxndrkalinin commented 3 months ago

I'm trying to recall and need @alxndrkalinin help - didn't we attempt a thing where we tried to mean-average each well position across all plates in the experiment? That would have gotten rid of this pattern but I don't know if that ended up in final profiles and/or if that step happens after sphering/harmony (I don't think that is the case).

Alex, correct me if I am wrong. We do subtract the mean feature value of each well position from each feature, and that is the first step along the current profile processing pipeline (for ORF and CRISPR).

We did implement this step, and it was first, but that was before @johnarevalo's optimization of all preprocessing steps. I'm not sure if it made it into the final version.

johnarevalo commented 3 months ago

Yes, it was included in the ORF and CRISPR pipeline. It is not part of the COMPOUND pipeline.

If the parquet file was produced with the Snakemake implementation, then the filename should have the wellpos string.

niranjchandrasekaran commented 3 months ago

Notebook

Strong plate layout effects.

ORF-plate-layout-CLDN3-SRI-SOCS2-ALMS1P1-PPCDC-RAB30-NAT14-TREML2-IL26-ASPDH-CRYGS-TM2D2-CEP104

spurious-connections-CLDN3-SRI-SOCS2-ALMS1P1-PPCDC-RAB30-NAT14

AnneCarpenter commented 3 months ago

@jessica-ewald IIUC we can close this ?

jessica-ewald commented 3 months ago

I think this is already closed - we just had extra comments afterwards.