afarrel commented 3 years ago

What are the scientific goals of the analysis?

Run fusion_filtering module for cohort == GMKF | CBTN | PNOC and run project-specific filtering by cancer_group instead of broad_histology

What methods do you plan to use to accomplish the scientific goals?

Complete these modules to process data and prepare for ingestion into Pediatric Open Targets platform.

What input data are required for this analysis?

fusion-arriba.tsv.gz
fusion-starfusion.tsv.gz
histologies.tsv

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

2-3 days

Who will complete the analysis (please add a GitHub handle here if relevant)?

@runjin326

runjin326 commented 3 years ago

As I was trying to do a test run of the module, there is some discrepancies between the data files we have and the files that are called in the .sh: Files not present: polya_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.polya.rds" stranded_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.stranded.rds" normal_expression_file="${references_path}/Brain_FPKM_hg38_matrix.txt.zip" independent_samples_file="${data_path}/independent-specimens.wgswxs.primary-plus.tsv" histologies_file="${data_path}/pbta-histologies-base.tsv"

Files with name discrepancies: arriba_file="${data_path}/pbta-fusion-arriba.tsv.gz" starfusion_file="${data_path}/pbta-fusion-starfusion.tsv.gz" histologies_file="${data_path}/pbta-histologies.tsv"

Should we just modify the .sh to match the files in /data folder and ignore the files not present?

Additionally, I have the following questions: For changing project-specific filtering, we need to change values for "group" in 04 as well as related codes in 06 as well, correct?

For running fusion_filtering module by cohort type, could we use a filtered histologies.csv e.g., histologies_pnoc.tsv to run the module?

jharenza commented 3 years ago

As I was trying to do a test run of the module, there is some discrepancies between the data files we have and the files that are called in the .sh: Files not present: polya_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.polya.rds" stranded_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.stranded.rds"

these are now combined into one file as gene-expression-rsem-fpkm.rds

normal_expression_file="${references_path}/Brain_FPKM_hg38_matrix.txt.zip"

for neuroblastoma, we will have to generate this file for the GTEX adrenal gland - we could either make a new ticket to create this or you can make staggered PRs, with this being one of the first PRs you submit.

independent_samples_file="${data_path}/independent-specimens.wgswxs.primary-plus.tsv"

this, we also have to generate per #37 - @kgaonkar6 was going to work on this, but you can also feel free to take a crack at it if you would like

histologies_file="${data_path}/pbta-histologies-base.tsv"

this is now just histologies.tsv

Files with name discrepancies: arriba_file="${data_path}/pbta-fusion-arriba.tsv.gz" starfusion_file="${data_path}/pbta-fusion-starfusion.tsv.gz" histologies_file="${data_path}/pbta-histologies.tsv"

these all now do not have the pbta- prefix

Should we just modify the .sh to match the files in /data folder and ignore the files not present?

I think we should make a new shell script for openTargets so that we maintain the OpenPBTA files - thoughts @kgaonkar6 ?

Additionally, I have the following questions: For changing project-specific filtering, we need to change values for "group" in 04 as well as related codes in 06 as well, correct?

yes

For running fusion_filtering module by cohort type, could we use a filtered histologies.csv e.g., histologies_pnoc.tsv to run the module?

It looks like the code can just take the file without filtering (here, histologies.tsv).

runjin326 commented 3 years ago

Yes I meant if we want to run the module for cohort == GMKF | CBTN | PNOC, do we run separate analyses? Or how should we do it?

runjin326 commented 3 years ago

And for the expression matrix, since these are now combined into one file as gene-expression-rsem-fpkm.rds, we will need to modify the codes for that as well, correct?

jharenza commented 3 years ago

Yes I meant if we want to run the module for cohort == GMKF | CBTN | PNOC, do we run separate analyses? Or how should we do it?

No need to separate - I just meant that we don't want to include GTEX or TARGET or TCGA at this time, however, they will not be in the fusion files, so it may be OK to just use the histologies file as-is.

And for the expression matrix, since these are now combined into one file as gene-expression-rsem-fpkm.rds, we will need to modify the codes for that as well, correct?

yes

runjin326 commented 3 years ago

Oh gotcha thanks that makes sense! For generating FPKM matrix for GTEX adrenal gland, could you please give a brief description as to how to do it? Sorry about all the questions!

kgaonkar6 commented 3 years ago

@runjin326 I think you can use gene-expression-rsem-tpm-collapsed.rds from the v5 release download-data.sh currently in the PR instead.

However this file is slightly different than the one we've used before. The v5 file only has rownames X sample as a matrix. The existing code in the repo had a matrix where 1st row was EnsembleID_GeneSymbol.

So you will need a few code updates like: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/5809781824c4804a2da2a01862af5b87af277cf0/analyses/fusion_filtering/02-fusion-filtering.R#L180-L182

# load expressionMatrix RDS for expression based filtering for less than given threshold
expressionMatrix<-readRDS(expressionMatrix)
expressionMatrix <- expressionMatrix %>% rownames_to_column(var="GeneSymbol")

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/5809781824c4804a2da2a01862af5b87af277cf0/analyses/fusion_filtering/03-Calc-zscore-annotate.R#L85-L96

  expressionMatrixMatched <- expressionMatrix %>%
    unique() %>%
    # means for each row per each gene_id
    dplyr::mutate(means = rowMeans(select(.,-GeneSymbol))) %>%
    # arrange descending mean
    arrange(desc(means)) %>%
    # to keep only first occurence ie. max rowMean per GeneSymbol
    distinct(GeneSymbol, .keep_all = TRUE) %>%
    ungroup() %>%
    dplyr::select(-means,-gene_id,-EnsembleID) %>%
    dplyr::filter(GeneSymbol %in% normData$GeneSymbol) %>%
    tibble::column_to_rownames("GeneSymbol")

runjin326 commented 3 years ago

Gotcha! Thanks so much!

runjin326 commented 3 years ago

After changing to cancer_group, do we want to change any of the following filtering parameter? Oncogene annotated fusions do not need to be in both callers to be retained however if these fusions are found in more than 4 histologies we treat them as false calls and remove them. To scavenge back non-oncogenic fusions that are recurrently found uniquely in a broadhistology we kept fusions that were called by both callers and if >2 samples per histology_ called the fusion. We removed the non-oncogenic fusions with genes fused more than 5 times in a samples or found in more than 1 histology as potential artifact.

runjin326 commented 3 years ago

@kgaonkar6 @jharenza Additionally, here is another parameter: Identifies recurrent fusions and genes that are recurrently observed in fusions. We identified RNA-seq samples that can be used independently for each patient. After the selection of samples we identify which fusions and genes are recurrent (found in >3 participants per histology) in our fusion-putative-oncogenic.tsv dataset.

runjin326 commented 3 years ago

@kgaonkar6 For the two references files, genelistreference.txt and fusionreference.txt. Do you have them or should we regenerate them? If the latter, do you have the original starting files?

jharenza commented 3 years ago

@kgaonkar6 For the two references files, genelistreference.txt and fusionreference.txt. Do you have them or should we regenerate them? If the latter, do you have the original starting files?

They are in this folder: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/fusion_filtering/references

jharenza commented 3 years ago

No, we should not change any parameters for this right now. Let's see how the results look.

runjin326 commented 3 years ago

Gotcha! Thanks!

runjin326 commented 3 years ago

@kgaonkar6, do you know which file is the original GTeX FPKM file that we can use to generate AdrenalGland_FPKM_hg38_matrix.txt.zip (counterpart of Brain_FPKM_hg38_matrix.txt.zip)?

jharenza commented 3 years ago

Those samples are in the expression TPM file in this repo- may also be worth regenerating the brain tpm zscores since we have more samples in gtex now.

You'll find the tissues under "gtex_group"

kgaonkar6 commented 3 years ago

right! the filename in v5 is gene-expression-rsem-tpm-collapsed.rds and gtex_group is in histologies.tsv

runjin326 commented 3 years ago

@kgaonkar6, just to clarify, so we use gene-expression-rsem-tpm-collapsed.rds as an input for expressionMatrix and by filtering this list by gtex_group, it will be used as normalExpressionMatrix? Also, for the normalExpressionMatrix, we do not need to further filter to just include normal samples right?

runjin326 commented 3 years ago

Those samples are in the expression TPM file in this repo- may also be worth regenerating the brain tpm zscores since we have more samples in gtex now.

Sure I will generate brain tpm zscores as well You'll find the tissues under "gtex_group"

kgaonkar6 commented 3 years ago

@kgaonkar6, just to clarify, so we use gene-expression-rsem-tpm-collapsed.rds as an input for expressionMatrix and by filtering this list by gtex_group, it will be used as normalExpressionMatrix?

That's right! expressionMatrix will have all tumor Kids_First_Biospecimen_IDs from cohort == GMKF | CBTN | PNOC & sample_type=="Tumor" and normalExpressionMatrix will have all cohort=="GTEX" & gtex_group=="Adrenal Gland"

Also, for the normalExpressionMatrix, we do not need to further filter to just include normal samples right?

You will already be filtering it with gtex ids as mentioned above so they will all be Normal

runjin326 commented 3 years ago

@kgaonkar6 Thanks!~ I think I need to understand the structures/composition of all the files better ;)

runjin326 commented 3 years ago

@kgaonkar6, @jharenza Never mind, I figure this out!

runjin326 commented 3 years ago

@jharenza @kgaonkar6, so I finished modifying all the codes and I was just trying to run the outputs. However, I am stuck on the last step of script 03 - ` expression_annotated_fusions <- fusion_sample_gene_df %>%

join the filtered expression values to the data frame keeping track of symbols

# for each sample-fusion name pair
dplyr::left_join(expression_long_df, by = c("Sample", "GeneSymbol")) %>% ...`
I will try splitting the df and iterate through the df list to see whether that works

kgaonkar6 commented 3 years ago

Ok yeah, you can try with a subset of expressionMatrix to see if the code runs.

Since your local Rstudio crashes I would assume it's a memory issue and we will need to set up an aws ec2 instance to run the code. Please go through the steps here https://www.notion.so/d3b/Accessing-kf-strides-569e6a853d5c47c69a3ac00ccd7a89e0#603ff13a021341a6be03620059be327c and please let me know if you have issues/questions and we can figure it out.

runjin326 commented 3 years ago

@kgaonkar6 Thanks! Unfortunately I don't have a CHOP issued device yet so I can't VPN into CHOP or sudo anything. I will run the code in pieces locally for now and leave the original code as is then. :)

jharenza commented 3 years ago

closing with PRs 46, 37, 38, 39, 40, 41, 42

d3b-center / ticket-tracker-OPC

Proposed Analysis: Run fusion modules for GMKF NBL #7

What are the scientific goals of the analysis?

What methods do you plan to use to accomplish the scientific goals?

What input data are required for this analysis?

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Who will complete the analysis (please add a GitHub handle here if relevant)?

join the filtered expression values to the data frame keeping track of symbols