AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Rerun hgg subtyping (7 of N) #1382

Closed jharenza closed 2 years ago

jharenza commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

rerun HGG subtyping

Is there anything that you want to discuss further?

Removed from HGG_defining_lesions.tsv because path_dx == SEGA (this is reflected in the v21 hist file, with cancer group == SEGA). Interestingly, these samples were not in HGG_molecular_subtype.tsv, so they are not showing as removed. 7316-2578 7316-2171 7316-3019

HGG_molecular_subtype.tsv PT_8P368R5B 7316-4998 being removed, but is DIPG and has a subtype in v21 Brainstem glioma- Diffuse intrinsic pontine glioma is in exact dx strings, but is RNA only sample PT_8P368R5B is in the list of pts in molecular-subtyping-pathology, but had no report, so no subtype was created there

BS_HE0WJRW6 of 7316-1455 was removed, BS_HWGWYCY7 retained Both RNA-Seq, both in v21 subtypes as to be classified

> setdiff(master_hgg$Kids_First_Biospecimen_ID_DNA, pr_hgg$Kids_First_Biospecimen_ID_DNA)
character(0)
> setdiff(master_hgg$Kids_First_Biospecimen_ID_RNA, pr_hgg$Kids_First_Biospecimen_ID_RNA)
[1] "BS_SB12W1XT" "BS_FXJY0MNH" "BS_HE0WJRW6" "BS_D7XRFE0R" "BS_KABQQA0T" "BS_FN07P04C" "BS_SHJA4MR0"
> setdiff(master_hgg$Kids_First_Participant_ID, pr_hgg$Kids_First_Participant_ID)
[1] "PT_8P368R5B"
master_hgg$Kids_First_Biospecimen_ID_RNA sample_id master_subtype
BS_SB12W1XT 7316-85 HGG, H3 wildtype
BS_FXJY0MNH 7316-4998 HGG, To be classified
BS_HE0WJRW6 7316-1455 HGG, To be classified, TP53 loss
BS_D7XRFE0R A18777 DMG, H3 K28, TP53 loss
BS_KABQQA0T A16915 DMG, H3 K28, TP53 loss
BS_FN07P04C 7316-255 HGG, To be classified
BS_SHJA4MR0 7316-161 HGG, H3 wildtype, TP53 loss

Those samples have been recently removed from the TP53 results file:

p53 <- read_tsv("~/Documents/GitHub/OpenPBTA-analysis/analyses/tp53_nf1_score/results/tp53_altered_status.tsv")
> setdiff(master_hgg$Kids_First_Biospecimen_ID_RNA, p53$Kids_First_Biospecimen_ID_RNA)
[1] "BS_SB12W1XT" "BS_FXJY0MNH" "BS_HE0WJRW6" "BS_D7XRFE0R" "BS_KABQQA0T" "BS_FN07P04C" "BS_SHJA4MR0"

I found it pretty hard to spot the diffs here, so I propose updating the code to add an arrange(Kids_First_Biospecimen_ID_DNA) to the end of the code, but will do after you take a look.

It appears these samples were removed with this PR, again really hard to spot without an arrangement before file output.

But the plot thickens... these samples are all in pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds, which is the input for the classifier. However, they are not in pbta-gene-expression-rsem-fpkm-collapsed.stranded_classifier_scores.tsv, the results of the classifier.

However, the files in OpenPBTA-analysis/analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds are these exact 7 samples short. Even though there is an ifelse to run using the /data folder when not run for subtyping, it seems that although I am reading the logic as OK, the module has been using the results file from collapse-rnaseq.

jaclyn-taroni commented 2 years ago

My understanding of #1389 is that we should close this PR, get all the code changes we know are required in, and then take another pass at rerunning it. So, I am going to close this.