AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 67 forks source link

Assemble files for new Zenodo submission #1692

Closed jaclyn-taroni closed 1 year ago

jaclyn-taroni commented 1 year ago

This issue will almost certainly need to be broken up into more narrowly scoped tasks, but I am filing this to communicate an overview of what we need to accomplish. ⚠️ Please set any more narrowly scoped issues as blocking this one!

Contents

We are going to create an upload to Zenodo that contains:

Proposed Structure and Conventions

I welcome feedback on this proposed structure, in particular. One alternative idea I had was be to put zenodo-upload in the root of the repository.

sjspielman commented 1 year ago

Discussion item: What exactly should we export as CSV files for oncoprints, exactly (this question is not about individual molecular alterations)? There are 3 tables that we can pull out of a given maf_object:

I see a couple options...

  1. Pick one of those an export it! Leaves something (like, the rest of the data...) to be desired
  2. Export each as three separate files, with some intelligent naming.
  3. Join them (either all 3 or just the first 2) and export a single CSV which roughly approximates a maf object.

I'm in the option 3 camp, but very open to persuasion or a 4th option that anyone can think of! CC @jashapiro @jaclyn-taroni

A note for future Stephanie

For actually compiling the summarized-molecular-alterations.csv file, just have to combine each histology's joined getSampleSummary() and getGeneSummary().

jaclyn-taroni commented 1 year ago

Trying option 3 to start seems reasonable. I am not sure how much clinical data we will want to include. We might want to check in – look at the output together – before you file a PR.

jashapiro commented 1 year ago

Is there a reason not to just use the MAF file as it is?

jaclyn-taroni commented 1 year ago

Is there a reason not to just use the MAF file as it is?

It would be missing fusions and CNVs.

jharenza commented 1 year ago

Discussion item: What exactly should we export as CSV files for oncoprints, exactly (this question is not about individual molecular alterations)? There are 3 tables that we can pull out of a given maf_object:

  • getSampleSummary(maf_object) - df of alterations across samples
  • getGeneSummary(maf_object) - df of alternations across genes
  • getClinicalData(maf_object) - this is more or less the metadata

I see a couple options...

  1. Pick one of those an export it! Leaves something (like, the rest of the data...) to be desired
  2. Export each as three separate files, with some intelligent naming.
  3. Join them (either all 3 or just the first 2) and export a single CSV which roughly approximates a maf object.

I'm in the option 3 camp, but very open to persuasion or a 4th option that anyone can think of! CC @jashapiro @jaclyn-taroni

A note for future Stephanie

For actually compiling the summarized-molecular-alterations.csv file, just have to combine each histology's joined getSampleSummary() and getGeneSummary().

I'm not entirely sure these tables get what we are after - looking briefly:

  1. getSampleSummary(maf_object) - df of alterations across samples --> this doesn't include genes, just the alteration type and number
  2. getGeneSummary(maf_object) - df of alternations across genes --> this doesn't get sample information, so not sure how 1 and 2 can be joined
  3. getClinicalData(maf_object) - don't think this is needed

I was thinking that we would have a table of samples and alterations by gene, such as an oncomatrix, but detailed by exact alteration. This can be achieved by just exporting the information from the combined MAF (SNV+CNV+fusion), for all genes in the oncoprint gene list and all variant classifications which were used in the oncoprint. Does that make sense?

The oncomatrix which can be printed directly from the oncoprint step is something like: GeneA GeneB GeneC
Sample1 Missense_mutation Fusion
Sample2 Fusion
What I was thinking was something like: FGFR1 H3F3A BRAF
Sample1 KIAA1549--BRAF
Sample2 p.K28M
Sample3 Amplification
sjspielman commented 1 year ago

Some comments while working towards Figure S2 panels A-F -

There is very Big data here for PBTA! I'll note some good news is that the PBTA and TCGA data that go into panels <A,B,C> and <E,D,F>, respectively, is all about the same so we will have with 1 CSV export for each set of three panels (with appropriate documentation). The PBTA data is not going to fit github which is not very surprising. With gz compression it's 955 MB, and while other compression algorithms might drop that, the size certainly won't drop enough for version control! One approach could be to write the data anyways but pop the file into a .gitignore (again this would be documented!) for the github repo, but we would still upload to Zenodo in the end. I can keep this file living in our OpenPBTA google drive so it's accessible internally within the project. CC @jashapiro @jaclyn-taroni ?

sjspielman commented 1 year ago

At this point, we are mostly there! There are only three remaining items:

I'm therefore going to close this big issue to track these three smaller scoped items separately.