Closed kgaonkar6 closed 2 years ago
@kgaonkar6, I am working on this ticket and I have a few clarification questions:
1) For subsetting for PBTA and GMKF, where should we start doing that? Are we starting from calculating the scores or when we are actually doing the modeling? Also, do we do the same analyses for all TARGET samples?
2) Could you please point me to the code where this file file.path(root_dir, "figures", "palettes", "histology_label_color_table.tsv")
is generated? We need to update that since we have more samples now.
3) For doing the anova and tukey test, previously we separated poly-A and stranded. Although we now have combined expression, I think it still makes senes to first separate them based on experimental_strategy
and then within the group, do statistics on display_group
and harmonized_diagnosis
. What do you think? Should we group them by cohort+experimental_strategy and run stats for each combination?
cc: @jharenza for input as well.
@kgaonkar6, I am working on this ticket and I have a few clarification questions:
- For subsetting for PBTA and GMKF, where should we start doing that? Are we starting from calculating the scores or when we are actually doing the modeling? Also, do we do the same analyses for all TARGET samples?
We want to start from calculating the scores because we have new RNA-seq samples now from TARGET ( which we didn't have when I created the ticket, sorry about that).
- Could you please point me to the code where this file
file.path(root_dir, "figures", "palettes", "histology_label_color_table.tsv")
is generated? We need to update that since we have more samples now.
The file is here: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/84c6e0a64ac34f8b76a7bc4559f2f21be95e4f50/figures/palettes/histology_label_color_table.tsv We used rprojroot package to find the root_dir is most R scripts which points to the main folder of the git repo. root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
- For doing the anova and tukey test, previously we separated poly-A and stranded. Although we now have combined expression, I think it still makes senes to first separate them based on
experimental_strategy
and then within the group, do statistics ondisplay_group
andharmonized_diagnosis
. What do you think? Should we group them by cohort+experimental_strategy and run stats for each combination?
Currently the subtyping modules only uses the gsva scores directly here https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/84c6e0a64ac34f8b76a7bc4559f2f21be95e4f50/analyses/molecular-subtyping-EPN/run-molecular-subtyping-EPN.sh#L25 The 02 script is used to evaluate the scores so maybe we can probably generate tables for combined, stranded and polya separately without much code update? I believe, the ANOVA + Turkey test with combined rnaseq input should be ok to use since the gene set variation scores is per sample, but we will have to see if we want to evaluate more.
cc: @jharenza for input as well.
@kgaonkar6, thanks for answering the questions. For number 2, I found the file - the issue is the file only has ~2000 lines and I believed it was for previous cohort? We now have 35827 samples and we need to re-generate the files per my understanding. I am just wondering whether you know where is the code that generates the output?
I guess for my question #3, I was just confused since from these lines, it seemed like the number of samples would impact the levels of ANOVA but I would actually go ahead and generate the following and go form there:
So now we will separate into three cohorts (PBTA, GMKF, TARGET) and generate 3 gsva scores table - and from there we generate:
gsva_anova_PBTA_stranded_display_group.tsv
gsva_anova_PBTA_polya_display_group.tsv
gsva_anova_PBTA_combined_display_group.tsv
(not rbind of the previous 2 files but from taking the combined as input)
gsva_anova_GMKF_stranded_display_group.tsv
gsva_anova_GMKF_polya_display_group.tsv
gsva_anova_PBTA_combined_display_group.tsv
(not rbind of the previous 2 files but from taking the combined as input)
gsva_anova_TARGET_stranded_display_group.tsv
gsva_anova_TARGET_polya_display_group.tsv
gsva_anova_PBTA_combined_display_group.tsv
(not rbind of the previous 2 files but from taking the combined as input)
@kgaonkar6, thanks for answering the questions. For number 2, I found the file - the issue is the file only has ~2000 lines and I believed it was for previous cohort? We now have 35827 samples and we need to re-generate the files per my understanding. I am just wondering whether you know where is the code that generates the output?
Oh sorry I misunderstood the question, the code that originally created the file is in https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/figures/mapping-histology-labels.Rmd but now we will have to use cancer_group
instead of display_group
@kgaonkar6, thanks! I think I will open a separate issue to generate this file, complete that and move on to this step (just to make each PR smaller).
Thanks!
This has me thinking that we will also need to update the display_group to cancer_group in OpenPBTA. I think we should discuss the order of PRs related to the cancer_group
update in each repo with @jharenza.
This has me thinking that we will also need to update the display_group to cancer_group in OpenPBTA. I think we should discuss the order of PRs related to the
cancer_group
update in each repo with @jharenza.
yes, that is true - and we need more color codes - I think we will still do N >= 5 for plots, too..
What analysis module should be updated and why?
Update
gene-set-enrichment-analysis
for OT filename updatesWhat changes need to be made? Please provide enough detail for another participant to make the update.
pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
topbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
But now we have the combined file ingene-expression-rsem-tpm-collapsed.rds
, please be sure to subset the file to Kids_First_Biospecimen_IDs that belong toPBTA
andGMKF
cohorts within the module.pbta-histologies.tsv
to `histologies.tsvWhat input data should be used? Which data were used in the version being updated?
v6 gene-expression-rsem-tpm-collapsed.rds
When do you expect the revised analysis will be completed?
1-2 days
Who will complete the updated analysis?