jharenza commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

This updates the histologies file with MB WGS samples as "To be classified", which were previously missed

What was your approach?

Updated https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/molecular-subtyping-MB/04-no-RNA-samples.R for subtypes to say MB, To be classified instead of To be classified
Updated https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/46a9d5c0656742b79aa472eadbd78f8bdd720fe4/analyses/molecular-subtyping-pathology/pathology_free_text-subtyping-lgat.Rmd to recode LGG, subtype --> SEGA, subtype
Created base-histologies.tsv from v21 and reran molecular-subtype-integrate to get pbta-histologies.tsv.

What GitHub issue does your pull request address?

1207

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Notes:

pbta-histology-base.tsv changes daily, and when rerunning using what I thought might be the base @kgaonkar6 used, I realized this module was recently rerun in this PR by @runjin326, perhaps using a more recent version of the file. Therefore, it looks like there are many more diffs than there really should be here.
To make sure none of the data aside from the few samples needing subtype updates changed, I simply created a new base file using v21 pbta-histologies.tsv minus the columns for harmonized_diagnosis and cancer_group. I put this bit of code at the very top of the script, but I think we may want to comment it out before merge? Similarly, I also added some QC into this, but maybe it is fine because this is the last(?) version of the histologies file for OpenPBTA?
After checking the diffs in the histologies file within this PR on GitHub, I realized some code which was implemented a while back for cancer groups never made it into the histologies file. This only pertains to the LGAT samples, but that means that it affects multiple figures.
Below are the diffs in cancer groups for LGAT broad_histology with the code we had in place (also in 01-integrate-subtyping.nb.html:

v21 %>%
  filter(short_histology == "LGAT") %>%
  select(cancer_group, experimental_strategy) %>%
  table()
                                     experimental_strategy
cancer_group                          RNA-Seq WGS
  Diffuse fibrillary astrocytoma            0   1
  Low-grade glioma astrocytoma            244 234
  Pilocytic astrocytoma                     1   2
  Pleomorphic xanthoastrocytoma             2   1
  Subependymal Giant Cell Astrocytoma       4   3

# v22
histology %>%
  filter(short_histology == "LGAT") %>%
  select(cancer_group, experimental_strategy) %>%
  table()
                                     experimental_strategy
cancer_group                          RNA-Seq WGS
  Diffuse fibrillary astrocytoma            6   6
  Gliomatosis cerebri                       1   1
  Low-grade glioma astrocytoma             94  89
  Oligodendroglioma                         1   1
  Pilocytic astrocytoma                   126 121
  Pleomorphic xanthoastrocytoma            11  11
  Subependymal Giant Cell Astrocytoma      12  12

The idea behind this separation into cancer groups before was to visualize the smaller groups within the oncoprint. The main takeaway, though, is that because there were a handful of pilocytic, and pleomorphic (pxa) not in the Low-grade glioma astrocytoma cancer_group _and there were some SEGA in the Low-grade glioma astrocytoma cancer_group, the analyses are not performed on the exact cohort of interest, so this is not an easy fix by simply recoding the v22 cancer_group back to v21. I also realized that Ganglioglioma is already its own cancer group and has a high enough N, so is in many plots already, but was missed the survival LGG_group.

I suppose my thoughts from all of this are that if we have to remake figures anyway, it probably makes sense to keep the cancer group code as it was added by @kgaonkar6, we may have to make a few more colors in the palette, and update survival to use the relevant cancer groups within LGG. 😭

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

no, but we need to discuss next steps

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

[ ] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[ ] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[ ] The analytical code is documented and contains comments.

jharenza commented 2 years ago

Ok, I perhaps need to rerun the LGAT subtyping module as well.

jharenza commented 2 years ago

@jaclyn-taroni I need some help. I tried rerunning molecular subyping for LGAT, but I am running into errors. First, in ce8dbd2, I am updating the 01 script. It would kill at the rbind step for consensus and hotspot mafs, so I reordered the code to pull LGAT samples out of these files upon reading so that they aren't so big. That worked, and 03 is now giving an error at chunk 7, when making the TxDb from GTF for FGFR1. I saw some perhaps related tickets suggesting this may be due to unstable RefSeq files? I am not sure what to do here.

jharenza commented 2 years ago

closing this and will start fresh once some of the code updates are merged.

AlexsLemonade / OpenPBTA-analysis

release V22 #1365

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

1207

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist