Rerun hgg subtyping (7 of N)

Purpose/implementation Section

What scientific question is your analysis addressing?

rerun HGG subtyping

Is there anything that you want to discuss further?

Removed from HGG_defining_lesions.tsv because path_dx == SEGA (this is reflected in the v21 hist file, with cancer group == SEGA). Interestingly, these samples were not in HGG_molecular_subtype.tsv, so they are not showing as removed. 7316-2578 7316-2171 7316-3019

HGG_molecular_subtype.tsv PT_8P368R5B 7316-4998 being removed, but is DIPG and has a subtype in v21 Brainstem glioma- Diffuse intrinsic pontine glioma is in exact dx strings, but is RNA only sample PT_8P368R5B is in the list of pts in molecular-subtyping-pathology, but had no report, so no subtype was created there

BS_HE0WJRW6 of 7316-1455 was removed, BS_HWGWYCY7 retained Both RNA-Seq, both in v21 subtypes as to be classified

> setdiff(master_hgg$Kids_First_Biospecimen_ID_DNA, pr_hgg$Kids_First_Biospecimen_ID_DNA)
character(0)
> setdiff(master_hgg$Kids_First_Biospecimen_ID_RNA, pr_hgg$Kids_First_Biospecimen_ID_RNA)
[1] "BS_SB12W1XT" "BS_FXJY0MNH" "BS_HE0WJRW6" "BS_D7XRFE0R" "BS_KABQQA0T" "BS_FN07P04C" "BS_SHJA4MR0"
> setdiff(master_hgg$Kids_First_Participant_ID, pr_hgg$Kids_First_Participant_ID)
[1] "PT_8P368R5B"

master_hgg$Kids_First_Biospecimen_ID_RNA	sample_id	master_subtype
BS_SB12W1XT	7316-85	HGG, H3 wildtype
BS_FXJY0MNH	7316-4998	HGG, To be classified
BS_HE0WJRW6	7316-1455	HGG, To be classified, TP53 loss
BS_D7XRFE0R	A18777	DMG, H3 K28, TP53 loss
BS_KABQQA0T	A16915	DMG, H3 K28, TP53 loss
BS_FN07P04C	7316-255	HGG, To be classified
BS_SHJA4MR0	7316-161	HGG, H3 wildtype, TP53 loss

Those samples have been recently removed from the TP53 results file:

p53 <- read_tsv("~/Documents/GitHub/OpenPBTA-analysis/analyses/tp53_nf1_score/results/tp53_altered_status.tsv")
> setdiff(master_hgg$Kids_First_Biospecimen_ID_RNA, p53$Kids_First_Biospecimen_ID_RNA)
[1] "BS_SB12W1XT" "BS_FXJY0MNH" "BS_HE0WJRW6" "BS_D7XRFE0R" "BS_KABQQA0T" "BS_FN07P04C" "BS_SHJA4MR0"

I found it pretty hard to spot the diffs here, so I propose updating the code to add an arrange(Kids_First_Biospecimen_ID_DNA) to the end of the code, but will do after you take a look.

It appears these samples were removed with this PR, again really hard to spot without an arrangement before file output.

But the plot thickens... these samples are all in pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds, which is the input for the classifier. However, they are not in pbta-gene-expression-rsem-fpkm-collapsed.stranded_classifier_scores.tsv, the results of the classifier.

However, the files in OpenPBTA-analysis/analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds are these exact 7 samples short. Even though there is an ifelse to run using the /data folder when not run for subtyping, it seems that although I am reading the logic as OK, the module has been using the results file from collapse-rnaseq.

AlexsLemonade / OpenPBTA-analysis