Independent samples reselect

runjin326 commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

@jharenza noticed that the number of samples in the forest plots are higher than expected and I realized that I did not select stranded and I performed distinct on sample_id, which instead should be Kids_First_Paricipant_ID. Hence an update of this module is needed.

What was your approach?

I first filter to RNA primary, stranded samples:

histologies_rna <- readr::read_tsv(metadata_file, guess_max = 10000) %>%
dplyr::filter(composition=="Solid Tissue" & 
              tumor_descriptor == "Initial CNS Tumor" &
              experimental_strategy == "RNA-Seq" & 
              RNA_library == "stranded") %>%
dplyr::rename(Kids_First_Biospecimen_ID_RNA = Kids_First_Biospecimen_ID) %>% 
dplyr::select(Kids_First_Participant_ID, Kids_First_Biospecimen_ID_RNA, cancer_group, OS_status, OS_days, PFS_days) %>%
distinct() %>%
dplyr::arrange(Kids_First_Biospecimen_ID_RNA)

Then I use this as base to add on tp53_scores and tel_scores

I then call distinct on Kids_First_Participant_ID to select independent samples

meta_indep <- histologies_rna %>%
left_join(tp53_scores) %>%
left_join(tel_scores) %>% 
dplyr::arrange(Kids_First_Biospecimen_ID_RNA) %>%
dplyr::distinct(Kids_First_Participant_ID, .keep_all=TRUE)

What GitHub issue does your pull request address?

N/A

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please check to see whether the changes/methods make sense.

Is there anything that you want to discuss further?

Currently, I call independent samples within the module - but the question is:

Do we want to update the independent-samples module to add primary-only RNA-seq sample list to data release and use that file for calling independent?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Figures.

What is your summary of the results?

HGG and telomerase scores are significantly associated with hazard scores.

A total of 696 independent participants with primary RNA-Seq stranded samples are present-

v21 %>%
filter(experimental_strategy == "RNA-Seq" & 
       tumor_descriptor == "Initial CNS Tumor" &
       composition == "Solid Tissue" & 
       RNA_library == "stranded") %>%
pull(Kids_First_Participant_ID) %>%
unique() %>%
length()
[1] 696

this is consistent with the number I got.

Reproducibility Checklist

[x] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[x] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[ ] The analytical code is documented and contains comments.

jharenza commented 2 years ago

Hi @runjin326 - thank you for the quick update! The numbers look good to me.

I have a question for @jaclyn-taroni and @envest. I am thinking the next step could be to split the histologies by nonHGG and HGG and create forest plots for tp53 and telomerase scores. I think that the significant effect of telomerase scores we are seeing in the current plot is largely driven by HGG samples having universally high scores and poor survival. The question we are interested in, outside of HGG samples, which we already know to have these high scores and low survival, is: do high tp53 scores improve OS when telomerase scores are high, but I am not sure this can precisely be answered with this analysis. It looks as though we can simply say tp53 scores or telomerase scores have an effect, is that right?

envest commented 2 years ago

@jharenza wrote:

I am thinking the next step could be to split the histologies by nonHGG and HGG and create forest plots for tp53 and telomerase scores. I think that the significant effect of telomerase scores we are seeing in the current plot is largely driven by HGG samples having universally high scores and poor survival.

@jaclyn-taroni and I talked over some options, and splitting the one model into two models by nonHGG and HGG is one good way to go. Another option is to add one or more interaction terms to the model (could be coded as: TP53*hgg_status and Telomerase*hgg_status. There would be pros and cons to each approach.

strategy	pros	cons
split models	simpler to interpret	lose HGG:nonHGG HR, fewer overall samples in each model
interaction terms	fewer models to present	model term significance less meaningful

The question we are interested in, outside of HGG samples, which we already know to have these high scores and low survival, is: do high tp53 scores improve OS when telomerase scores are high, but I am not sure this can precisely be answered with this analysis. It looks as though we can simply say tp53 scores or telomerase scores have an effect, is that right?

Currently, yes the model interpretation is limited to quantifying the HR associated with a 1 unit change in TP53 when Telomerase score is held constant, and vice versa. This answers the question: does an increase in TP53 score improve OS at any level of Telomerase score (low, average, or high).

jharenza commented 2 years ago

splitting the one model into two models by nonHGG and HGG is one good way to go. Another option is to add one or more interaction terms to the model

@envest shall we go with splitting the models for simplicity? Cc @runjin326

runjin326 commented 2 years ago

@jharenza, I have now added two forest plots for HGAT and non-HGAT group - and interestingly, it seemed like only telomerase scores in non-HGAT groups are significantly associated with hazard.

jaclyn-taroni commented 2 years ago

Requested a review from @envest to take a look at the most recent changes. If we get his 👍🏻, happy to get this merged.

AlexsLemonade / OpenPBTA-analysis