Closed afarrel closed 3 years ago
As I was trying to do a test run of the module, there is some discrepancies between the data files we have and the files that are called in the .sh: Files not present: polya_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.polya.rds" stranded_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.stranded.rds" normal_expression_file="${references_path}/Brain_FPKM_hg38_matrix.txt.zip" independent_samples_file="${data_path}/independent-specimens.wgswxs.primary-plus.tsv" histologies_file="${data_path}/pbta-histologies-base.tsv"
Files with name discrepancies: arriba_file="${data_path}/pbta-fusion-arriba.tsv.gz" starfusion_file="${data_path}/pbta-fusion-starfusion.tsv.gz" histologies_file="${data_path}/pbta-histologies.tsv"
Should we just modify the .sh to match the files in /data folder and ignore the files not present?
Additionally, I have the following questions: For changing project-specific filtering, we need to change values for "group" in 04 as well as related codes in 06 as well, correct?
For running fusion_filtering module by cohort type, could we use a filtered histologies.csv e.g., histologies_pnoc.tsv to run the module?
As I was trying to do a test run of the module, there is some discrepancies between the data files we have and the files that are called in the .sh: Files not present: polya_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.polya.rds" stranded_expression_file="${data_path}/pbta-gene-expression-rsem-fpkm.stranded.rds"
these are now combined into one file as gene-expression-rsem-fpkm.rds
normal_expression_file="${references_path}/Brain_FPKM_hg38_matrix.txt.zip"
for neuroblastoma, we will have to generate this file for the GTEX adrenal gland - we could either make a new ticket to create this or you can make staggered PRs, with this being one of the first PRs you submit.
independent_samples_file="${data_path}/independent-specimens.wgswxs.primary-plus.tsv"
this, we also have to generate per #37 - @kgaonkar6 was going to work on this, but you can also feel free to take a crack at it if you would like
histologies_file="${data_path}/pbta-histologies-base.tsv"
this is now just histologies.tsv
Files with name discrepancies: arriba_file="${data_path}/pbta-fusion-arriba.tsv.gz" starfusion_file="${data_path}/pbta-fusion-starfusion.tsv.gz" histologies_file="${data_path}/pbta-histologies.tsv"
these all now do not have the pbta-
prefix
Should we just modify the .sh to match the files in /data folder and ignore the files not present?
I think we should make a new shell script for openTargets so that we maintain the OpenPBTA files - thoughts @kgaonkar6 ?
Additionally, I have the following questions: For changing project-specific filtering, we need to change values for "group" in 04 as well as related codes in 06 as well, correct?
yes
For running fusion_filtering module by cohort type, could we use a filtered histologies.csv e.g., histologies_pnoc.tsv to run the module?
It looks like the code can just take the file without filtering (here, histologies.tsv
).
Yes I meant if we want to run the module for cohort == GMKF | CBTN | PNOC, do we run separate analyses? Or how should we do it?
And for the expression matrix, since these are now combined into one file as gene-expression-rsem-fpkm.rds, we will need to modify the codes for that as well, correct?
Yes I meant if we want to run the module for cohort == GMKF | CBTN | PNOC, do we run separate analyses? Or how should we do it?
No need to separate - I just meant that we don't want to include GTEX or TARGET or TCGA at this time, however, they will not be in the fusion files, so it may be OK to just use the histologies file as-is.
And for the expression matrix, since these are now combined into one file as gene-expression-rsem-fpkm.rds, we will need to modify the codes for that as well, correct?
yes
Oh gotcha thanks that makes sense! For generating FPKM matrix for GTEX adrenal gland, could you please give a brief description as to how to do it? Sorry about all the questions!
@runjin326 I think you can use gene-expression-rsem-tpm-collapsed.rds
from the v5 release download-data.sh currently in the PR instead.
However this file is slightly different than the one we've used before. The v5 file only has rownames X sample as a matrix. The existing code in the repo had a matrix where 1st row was EnsembleID_GeneSymbol
.
So you will need a few code updates like: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/5809781824c4804a2da2a01862af5b87af277cf0/analyses/fusion_filtering/02-fusion-filtering.R#L180-L182
# load expressionMatrix RDS for expression based filtering for less than given threshold
expressionMatrix<-readRDS(expressionMatrix)
expressionMatrix <- expressionMatrix %>% rownames_to_column(var="GeneSymbol")
expressionMatrixMatched <- expressionMatrix %>%
unique() %>%
# means for each row per each gene_id
dplyr::mutate(means = rowMeans(select(.,-GeneSymbol))) %>%
# arrange descending mean
arrange(desc(means)) %>%
# to keep only first occurence ie. max rowMean per GeneSymbol
distinct(GeneSymbol, .keep_all = TRUE) %>%
ungroup() %>%
dplyr::select(-means,-gene_id,-EnsembleID) %>%
dplyr::filter(GeneSymbol %in% normData$GeneSymbol) %>%
tibble::column_to_rownames("GeneSymbol")
Gotcha! Thanks so much!
After changing to cancer_group
, do we want to change any of the following filtering parameter?
Oncogene annotated fusions do not need to be in both callers to be retained however if these fusions are found in more than 4 histologies we treat them as false calls and remove them.
To scavenge back non-oncogenic fusions that are recurrently found uniquely in a broadhistology we kept fusions that were called by both callers and if >2 samples per histology_ called the fusion.
We removed the non-oncogenic fusions with genes fused more than 5 times in a samples or found in more than 1 histology as potential artifact.
@kgaonkar6 @jharenza Additionally, here is another parameter:
Identifies recurrent fusions and genes that are recurrently observed in fusions. We identified RNA-seq samples that can be used independently for each patient. After the selection of samples we identify which fusions and genes are recurrent (found in >3 participants per histology) in our fusion-putative-oncogenic.tsv
dataset.
@kgaonkar6 For the two references files, genelistreference.txt
and fusionreference.txt
. Do you have them or should we regenerate them? If the latter, do you have the original starting files?
@kgaonkar6 For the two references files,
genelistreference.txt
andfusionreference.txt
. Do you have them or should we regenerate them? If the latter, do you have the original starting files?
They are in this folder: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/fusion_filtering/references
No, we should not change any parameters for this right now. Let's see how the results look.
Gotcha! Thanks!
@kgaonkar6, do you know which file is the original GTeX FPKM file that we can use to generate AdrenalGland_FPKM_hg38_matrix.txt.zip
(counterpart of Brain_FPKM_hg38_matrix.txt.zip
)?
Those samples are in the expression TPM file in this repo- may also be worth regenerating the brain tpm zscores since we have more samples in gtex now.
You'll find the tissues under "gtex_group"
right! the filename in v5 is gene-expression-rsem-tpm-collapsed.rds
and gtex_group is in histologies.tsv
@kgaonkar6, just to clarify, so we use gene-expression-rsem-tpm-collapsed.rds
as an input for expressionMatrix
and by filtering this list by gtex_group, it will be used as normalExpressionMatrix
? Also, for the normalExpressionMatrix
, we do not need to further filter to just include normal samples right?
Those samples are in the expression TPM file in this repo- may also be worth regenerating the brain tpm zscores since we have more samples in gtex now.
Sure I will generate brain tpm zscores as well You'll find the tissues under "gtex_group"
@kgaonkar6, just to clarify, so we use
gene-expression-rsem-tpm-collapsed.rds
as an input forexpressionMatrix
and by filtering this list by gtex_group, it will be used asnormalExpressionMatrix
?
That's right!
expressionMatrix
will have all tumor Kids_First_Biospecimen_IDs from cohort == GMKF | CBTN | PNOC & sample_type=="Tumor"
and normalExpressionMatrix
will have all cohort=="GTEX" & gtex_group=="Adrenal Gland"
Also, for the normalExpressionMatrix, we do not need to further filter to just include normal samples right?
You will already be filtering it with gtex ids as mentioned above so they will all be Normal
@kgaonkar6 Thanks!~ I think I need to understand the structures/composition of all the files better ;)
@kgaonkar6, @jharenza Never mind, I figure this out!
@jharenza @kgaonkar6, so I finished modifying all the codes and I was just trying to run the outputs. However, I am stuck on the last step of script 03 - ` expression_annotated_fusions <- fusion_sample_gene_df %>%
# for each sample-fusion name pair
dplyr::left_join(expression_long_df, by = c("Sample", "GeneSymbol")) %>% ...`
I will try splitting the df and iterate through the df list to see whether that works
Ok yeah, you can try with a subset of expressionMatrix to see if the code runs.
Since your local Rstudio crashes I would assume it's a memory issue and we will need to set up an aws ec2 instance to run the code. Please go through the steps here https://www.notion.so/d3b/Accessing-kf-strides-569e6a853d5c47c69a3ac00ccd7a89e0#603ff13a021341a6be03620059be327c and please let me know if you have issues/questions and we can figure it out.
@kgaonkar6 Thanks! Unfortunately I don't have a CHOP issued device yet so I can't VPN into CHOP or sudo anything. I will run the code in pieces locally for now and leave the original code as is then. :)
closing with PRs 46, 37, 38, 39, 40, 41, 42
What are the scientific goals of the analysis?
Run
fusion_filtering
module forcohort == GMKF | CBTN | PNOC
and run project-specific filtering bycancer_group
instead ofbroad_histology
What methods do you plan to use to accomplish the scientific goals?
Complete these modules to process data and prepare for ingestion into Pediatric Open Targets platform.
What input data are required for this analysis?
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
2-3 days
Who will complete the analysis (please add a GitHub handle here if relevant)?
@runjin326