Closed jaclyn-taroni closed 1 year ago
Discussion item: What exactly should we export as CSV files for oncoprints, exactly (this question is not about individual molecular alterations)? There are 3 tables that we can pull out of a given maf_object
:
getSampleSummary(maf_object)
- df of alterations across samplesgetGeneSummary(maf_object)
- df of alternations across genesgetClinicalData(maf_object)
- this is more or less the metadataI see a couple options...
I'm in the option 3 camp, but very open to persuasion or a 4th option that anyone can think of! CC @jashapiro @jaclyn-taroni
For actually compiling the summarized-molecular-alterations.csv
file, just have to combine each histology's joined getSampleSummary()
and getGeneSummary()
.
Trying option 3 to start seems reasonable. I am not sure how much clinical data we will want to include. We might want to check in – look at the output together – before you file a PR.
Is there a reason not to just use the MAF file as it is?
Is there a reason not to just use the MAF file as it is?
It would be missing fusions and CNVs.
Discussion item: What exactly should we export as CSV files for oncoprints, exactly (this question is not about individual molecular alterations)? There are 3 tables that we can pull out of a given
maf_object
:
getSampleSummary(maf_object)
- df of alterations across samplesgetGeneSummary(maf_object)
- df of alternations across genesgetClinicalData(maf_object)
- this is more or less the metadataI see a couple options...
- Pick one of those an export it! Leaves something (like, the rest of the data...) to be desired
- Export each as three separate files, with some intelligent naming.
- Join them (either all 3 or just the first 2) and export a single CSV which roughly approximates a maf object.
I'm in the option 3 camp, but very open to persuasion or a 4th option that anyone can think of! CC @jashapiro @jaclyn-taroni
A note for future Stephanie
For actually compiling the
summarized-molecular-alterations.csv
file, just have to combine each histology's joinedgetSampleSummary()
andgetGeneSummary()
.
I'm not entirely sure these tables get what we are after - looking briefly:
getSampleSummary(maf_object)
- df of alterations across samples --> this doesn't include genes, just the alteration type and numbergetGeneSummary(maf_object)
- df of alternations across genes --> this doesn't get sample information, so not sure how 1 and 2 can be joinedgetClinicalData(maf_object)
- don't think this is neededI was thinking that we would have a table of samples and alterations by gene, such as an oncomatrix, but detailed by exact alteration. This can be achieved by just exporting the information from the combined MAF (SNV+CNV+fusion), for all genes in the oncoprint gene list and all variant classifications which were used in the oncoprint. Does that make sense?
The oncomatrix which can be printed directly from the oncoprint step is something like: | GeneA | GeneB | GeneC | |
---|---|---|---|---|
Sample1 | Missense_mutation | Fusion | ||
Sample2 | Fusion |
What I was thinking was something like: | FGFR1 | H3F3A | BRAF | |
---|---|---|---|---|
Sample1 | KIAA1549--BRAF | |||
Sample2 | p.K28M | |||
Sample3 | Amplification |
Some comments while working towards Figure S2 panels A-F -
There is very Big data here for PBTA! I'll note some good news is that the PBTA and TCGA data that go into panels <A,B,C> and <E,D,F>, respectively, is all about the same so we will have with 1 CSV export for each set of three panels (with appropriate documentation).
The PBTA data is not going to fit github which is not very surprising. With gz
compression it's 955 MB, and while other compression algorithms might drop that, the size certainly won't drop enough for version control! One approach could be to write the data anyways but pop the file into a .gitignore
(again this would be documented!) for the github repo, but we would still upload to Zenodo in the end. I can keep this file living in our OpenPBTA google drive so it's accessible internally within the project.
CC @jashapiro @jaclyn-taroni ?
At this point, we are mostly there! There are only three remaining items:
I'm therefore going to close this big issue to track these three smaller scoped items separately.
This issue will almost certainly need to be broken up into more narrowly scoped tasks, but I am filing this to communicate an overview of what we need to accomplish. ⚠️ Please set any more narrowly scoped issues as blocking this one!
Contents
We are going to create an upload to Zenodo that contains:
figures/generate-figures.sh
copies a figure panel from module inanalyses
, output the underlying data in the analysis module and copy it to the right place infigures/generate-figures.sh
; otherwise, include the step that outputs the CSV file in the figure script (figures/scripts/
) and write it directly where we need it to go.sample_id
.README.md
that describes all of these files.Proposed Structure and Conventions
I welcome feedback on this proposed structure, in particular. One alternative idea I had was be to put
zenodo-upload
in the root of the repository.tables/results
->tables/manuscript-tables
; consider splitting tables as main display items vs. supplemental data into their own directories.tables/zenodo-upload/
which will contain everything to be included in the upload.tables/zenodo-upload/figure-data
will contain the CSV files with underlying figure data.<figure number and panel>-<hyphen separated description of data>.csv
For example:fig2A-lgat-oncoplot-matrix.csv
tables/zenodo-upload/summarized-molecular-alterations.csv
will contain the table of molecular alterations included in Figure 2 for all tumor and cell line samples in the PBTA dataset.tables
that generates this, and then running it viatables/run-manuscript-tables.sh
.tables/zenodo-upload/README.md
will be the README file we include in the upload.