Assemble files for new Zenodo submission

jaclyn-taroni commented 1 year ago

This issue will almost certainly need to be broken up into more narrowly scoped tasks, but I am filing this to communicate an overview of what we need to accomplish. ⚠️ Please set any more narrowly scoped issues as blocking this one!

A CSV file for each set of tabular data that underlies individual figure panels. In some cases, multiple panels use a lot of the same data – for example, Figure 4B and 4D both use TP53 classifier scores but have different factors on their x axes – so this is perhaps not as simple as one CSV per panel!
- Proposed rule of thumb: If figures/generate-figures.sh copies a figure panel from module in analyses, output the underlying data in the analysis module and copy it to the right place in figures/generate-figures.sh; otherwise, include the step that outputs the CSV file in the figure script (figures/scripts/) and write it directly where we need it to go.
- All of these should be ordered by sample_id.
A CSV file that contains a table of molecular alterations included in Figure 2 for all tumor and cell line samples in the PBTA dataset (implementation note: that can have their RNA and DNA specimens mapped to one another). This is where I recommend starting: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/627ec427ad0a8d9d913e614c9db50546c56d8283/analyses/oncoprint-landscape/01-map-to-sample_id.R
A README.md that describes all of these files.

Proposed Structure and Conventions

I welcome feedback on this proposed structure, in particular. One alternative idea I had was be to put zenodo-upload in the root of the repository.

Change tables/results -> tables/manuscript-tables; consider splitting tables as main display items vs. supplemental data into their own directories.
Create tables/zenodo-upload/ which will contain everything to be included in the upload.
The subdirectory tables/zenodo-upload/figure-data will contain the CSV files with underlying figure data.
- Filenames can follow the convention: <figure number and panel>-<hyphen separated description of data>.csv For example: fig2A-lgat-oncoplot-matrix.csv
tables/zenodo-upload/summarized-molecular-alterations.csv will contain the table of molecular alterations included in Figure 2 for all tumor and cell line samples in the PBTA dataset.
- We might consider including a script in tables that generates this, and then running it via tables/run-manuscript-tables.sh.
tables/zenodo-upload/README.md will be the README file we include in the upload.

sjspielman commented 1 year ago

Discussion item: What exactly should we export as CSV files for oncoprints, exactly (this question is not about individual molecular alterations)? There are 3 tables that we can pull out of a given maf_object:

getSampleSummary(maf_object) - df of alterations across samples
getGeneSummary(maf_object) - df of alternations across genes
getClinicalData(maf_object) - this is more or less the metadata

I see a couple options...

Pick one of those an export it! Leaves something (like, the rest of the data...) to be desired
Export each as three separate files, with some intelligent naming.
Join them (either all 3 or just the first 2) and export a single CSV which roughly approximates a maf object.

I'm in the option 3 camp, but very open to persuasion or a 4th option that anyone can think of! CC @jashapiro @jaclyn-taroni

A note for future Stephanie

For actually compiling the summarized-molecular-alterations.csv file, just have to combine each histology's joined getSampleSummary() and getGeneSummary().

jaclyn-taroni commented 1 year ago

Trying option 3 to start seems reasonable. I am not sure how much clinical data we will want to include. We might want to check in – look at the output together – before you file a PR.

jashapiro commented 1 year ago

Is there a reason not to just use the MAF file as it is?

jaclyn-taroni commented 1 year ago

Is there a reason not to just use the MAF file as it is?

It would be missing fusions and CNVs.

jharenza commented 1 year ago

Discussion item: What exactly should we export as CSV files for oncoprints, exactly (this question is not about individual molecular alterations)? There are 3 tables that we can pull out of a given maf_object:

getSampleSummary(maf_object) - df of alterations across samples

getGeneSummary(maf_object) - df of alternations across genes

getClinicalData(maf_object) - this is more or less the metadata

I see a couple options...

Pick one of those an export it! Leaves something (like, the rest of the data...) to be desired

Export each as three separate files, with some intelligent naming.

Join them (either all 3 or just the first 2) and export a single CSV which roughly approximates a maf object.

I'm in the option 3 camp, but very open to persuasion or a 4th option that anyone can think of! CC @jashapiro @jaclyn-taroni

A note for future Stephanie

For actually compiling the summarized-molecular-alterations.csv file, just have to combine each histology's joined getSampleSummary() and getGeneSummary().

I'm not entirely sure these tables get what we are after - looking briefly:

getSampleSummary(maf_object) - df of alterations across samples --> this doesn't include genes, just the alteration type and number
getGeneSummary(maf_object) - df of alternations across genes --> this doesn't get sample information, so not sure how 1 and 2 can be joined
getClinicalData(maf_object) - don't think this is needed

I was thinking that we would have a table of samples and alterations by gene, such as an oncomatrix, but detailed by exact alteration. This can be achieved by just exporting the information from the combined MAF (SNV+CNV+fusion), for all genes in the oncoprint gene list and all variant classifications which were used in the oncoprint. Does that make sense?

The oncomatrix which can be printed directly from the oncoprint step is something like:		GeneA	GeneB	GeneC
Sample1	Missense_mutation	Fusion
Sample2		Fusion

What I was thinking was something like:		FGFR1	H3F3A
Sample1			KIAA1549--BRAF
Sample2		p.K28M
Sample3	Amplification

sjspielman commented 1 year ago

Some comments while working towards Figure S2 panels A-F -

There is very Big data here for PBTA! I'll note some good news is that the PBTA and TCGA data that go into panels <A,B,C> and <E,D,F>, respectively, is all about the same so we will have with 1 CSV export for each set of three panels (with appropriate documentation). The PBTA data is not going to fit github which is not very surprising. With gz compression it's 955 MB, and while other compression algorithms might drop that, the size certainly won't drop enough for version control! One approach could be to write the data anyways but pop the file into a .gitignore (again this would be documented!) for the github repo, but we would still upload to Zenodo in the end. I can keep this file living in our OpenPBTA google drive so it's accessible internally within the project. CC @jashapiro @jaclyn-taroni ?

sjspielman commented 1 year ago

At this point, we are mostly there! There are only three remaining items:

Finish molecular alteration CSV
Finish documentation (already a separate issue in #1716)
🚀 to Zenodo

I'm therefore going to close this big issue to track these three smaller scoped items separately.

AlexsLemonade / OpenPBTA-analysis