Updated analysis: gene-set-enrichment-analysis for OT subtyping

kgaonkar6 commented 3 years ago

What analysis module should be updated and why?

Update gene-set-enrichment-analysis for OT filename updates

What changes need to be made? Please provide enough detail for another participant to make the update.

Currently the scripts use: pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds to pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds But now we have the combined file in gene-expression-rsem-tpm-collapsed.rds, please be sure to subset the file to Kids_First_Biospecimen_IDs that belong to PBTA and GMKF cohorts within the module.
Change each occurrence ofpbta-histologies.tsv to `histologies.tsv

What input data should be used? Which data were used in the version being updated?

v6 gene-expression-rsem-tpm-collapsed.rds

When do you expect the revised analysis will be completed?

1-2 days

Who will complete the updated analysis?

runjin326 commented 3 years ago

@kgaonkar6, I am working on this ticket and I have a few clarification questions: 1) For subsetting for PBTA and GMKF, where should we start doing that? Are we starting from calculating the scores or when we are actually doing the modeling? Also, do we do the same analyses for all TARGET samples? 2) Could you please point me to the code where this file file.path(root_dir, "figures", "palettes", "histology_label_color_table.tsv") is generated? We need to update that since we have more samples now. 3) For doing the anova and tukey test, previously we separated poly-A and stranded. Although we now have combined expression, I think it still makes senes to first separate them based on experimental_strategy and then within the group, do statistics on display_group and harmonized_diagnosis. What do you think? Should we group them by cohort+experimental_strategy and run stats for each combination? cc: @jharenza for input as well.

kgaonkar6 commented 3 years ago

@kgaonkar6, I am working on this ticket and I have a few clarification questions:

For subsetting for PBTA and GMKF, where should we start doing that? Are we starting from calculating the scores or when we are actually doing the modeling? Also, do we do the same analyses for all TARGET samples?

We want to start from calculating the scores because we have new RNA-seq samples now from TARGET ( which we didn't have when I created the ticket, sorry about that).

Could you please point me to the code where this file file.path(root_dir, "figures", "palettes", "histology_label_color_table.tsv") is generated? We need to update that since we have more samples now.

The file is here: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/84c6e0a64ac34f8b76a7bc4559f2f21be95e4f50/figures/palettes/histology_label_color_table.tsv We used rprojroot package to find the root_dir is most R scripts which points to the main folder of the git repo. root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))

For doing the anova and tukey test, previously we separated poly-A and stranded. Although we now have combined expression, I think it still makes senes to first separate them based on experimental_strategy and then within the group, do statistics on display_group and harmonized_diagnosis. What do you think? Should we group them by cohort+experimental_strategy and run stats for each combination?

Currently the subtyping modules only uses the gsva scores directly here https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/84c6e0a64ac34f8b76a7bc4559f2f21be95e4f50/analyses/molecular-subtyping-EPN/run-molecular-subtyping-EPN.sh#L25 The 02 script is used to evaluate the scores so maybe we can probably generate tables for combined, stranded and polya separately without much code update? I believe, the ANOVA + Turkey test with combined rnaseq input should be ok to use since the gene set variation scores is per sample, but we will have to see if we want to evaluate more.

cc: @jharenza for input as well.

runjin326 commented 3 years ago

@kgaonkar6, thanks for answering the questions. For number 2, I found the file - the issue is the file only has ~2000 lines and I believed it was for previous cohort? We now have 35827 samples and we need to re-generate the files per my understanding. I am just wondering whether you know where is the code that generates the output?

I guess for my question #3, I was just confused since from these lines, it seemed like the number of samples would impact the levels of ANOVA but I would actually go ahead and generate the following and go form there: So now we will separate into three cohorts (PBTA, GMKF, TARGET) and generate 3 gsva scores table - and from there we generate: gsva_anova_PBTA_stranded_display_group.tsv gsva_anova_PBTA_polya_display_group.tsv gsva_anova_PBTA_combined_display_group.tsv (not rbind of the previous 2 files but from taking the combined as input) gsva_anova_GMKF_stranded_display_group.tsv gsva_anova_GMKF_polya_display_group.tsv gsva_anova_PBTA_combined_display_group.tsv (not rbind of the previous 2 files but from taking the combined as input) gsva_anova_TARGET_stranded_display_group.tsv gsva_anova_TARGET_polya_display_group.tsv gsva_anova_PBTA_combined_display_group.tsv (not rbind of the previous 2 files but from taking the combined as input)

kgaonkar6 commented 3 years ago

@kgaonkar6, thanks for answering the questions. For number 2, I found the file - the issue is the file only has ~2000 lines and I believed it was for previous cohort? We now have 35827 samples and we need to re-generate the files per my understanding. I am just wondering whether you know where is the code that generates the output?

Oh sorry I misunderstood the question, the code that originally created the file is in https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/figures/mapping-histology-labels.Rmd but now we will have to use cancer_group instead of display_group

runjin326 commented 3 years ago

@kgaonkar6, thanks! I think I will open a separate issue to generate this file, complete that and move on to this step (just to make each PR smaller).

kgaonkar6 commented 3 years ago

Thanks!

This has me thinking that we will also need to update the display_group to cancer_group in OpenPBTA. I think we should discuss the order of PRs related to the cancer_group update in each repo with @jharenza.

jharenza commented 3 years ago

This has me thinking that we will also need to update the display_group to cancer_group in OpenPBTA. I think we should discuss the order of PRs related to the cancer_group update in each repo with @jharenza.

yes, that is true - and we need more color codes - I think we will still do N >= 5 for plots, too..

runjin326 commented 2 years ago

Closing with PR118 merged

d3b-center / ticket-tracker-OPC