JulianneDavid / shared-cancer-splicing

Code for reproducing analyses and figures for shared alternative cancer splicing paper
MIT License
4 stars 2 forks source link

shared-cancer-splicing

Reproduces analyses and figures from "Putatively cancer-specific exon–exon junctions are shared across patients and present in developmental and other non-cancer cells" by David et al.

Requirements

Software

We ran Python scripts with Python 3.6.6, which we installed as part of Anaconda 5.2.0. We made use of the third-party modules intervaltree 2.1.0 and seaborn 0.9.0.

We ran R scripts with R 3.6.1 and made use of Data.table 1.12.2, ggplot2 3.2.1, gridExtra 2.3, RColorBrewer 1.1-2, dplyr 0.8.3, survminer 0.4.6, survival 2.44-1.1, ggpubr 0.2.4, and magrittr 1.5.

We further used Snaptron via its qs utility. To obtain it, run

git clone https://github.com/ChristopherWilks/snaptron-experiments

Data

We downloaded exon-exon junction BEDs for GTEx and TCGA and accompanying metadata from recount2:

We also used:

Note we also made use of metaSRA's ontology terms obtained via its API and available in https://github.com/JulianneDavid/shared-cancer-splicing/tree/master/SRA_junction_download/MetaSRA_run_files .

Execution

Preliminaries

  1. Create a new directory to store your database (mkdir DB_DIR) and then create TCGA and GTEx junction index by running jx_indexer.py in index mode:

    python3 jx_indexer.py -d DB_DIR index -c GTEx_JUNCTION_COVERAGE -C TCGA_JUNCTION_COVERAGE -b GTEx_JUNCTION_BED -B TCGA_JUNCTION_BED -p GTEx_PHEN -P TCGA_PHEN -s RECOUNT_SAMPLE_IDS -g GENCODE_ANNOTATION_GTF

    Note: steps 1 and 2 require significant temporary file space to execute. Please ensure that your system has ~20GB available space in the directory where temporary files are stored, or assign temp file storage to a new directory. If your temporary directory does not have enough available space, you will be directed to re-run step 1 in "index-only" mode, the same command as above but with the addition of the "-i" flag. This is NOT neccessary if your first index mode (step 1) run was successful.

  2. Run jx_indexer.py in experiment mode on junction index to collect junction files for analysis:

    python3 jx_indexer.py -d DB_DIR experiment -o OUTPUT_DIR

OUTPUT_DIR will now contain:

  1. Call this directory METASRA_QUERY_FILES. Run:

    python3 create_snaptron_queries.py -s METASRA_QUERY_FILES -i RECOUNT_SAMPLE_IDS -o LOGGING_OUTPUT_DIR -p SRA_JUNCTION_OUTPUT_DIR -u UTILITIES_DIR

where UTILITIES_DIRECTORY is where snaptron-experiments was previously cloned (see Requirements). SRA_JUNCTION_OUTPUT_DIR is where the results will appear.

  1. Run each SAMPLE_TYPE_indiv_chrom_batch_query_script.sh in SRA_JUNCTION_OUTPUT_DIR to collect SRA junctions.

SRA_JUNCTION_OUTPUT_DIR will now contain:

  1. Run collect_1-read_TCGA_jxs.py to collect single-read TCGA junction collection:

    python3 collect_1-read_TCGA_jxs.py -C TCGA_JUNCTION_COVERAGE -B TCGA_JUNCTION_BED -o JSON_DIR

JSON_DIR will now contain:

FULL_PIECHART_DIR will now contain:

To generate two-sample minimum junction sets (used in Figures S3E and S3F), run:

    python3 set_membership_annotation.py --db-path DB_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JX_DIR -g NON_CORE_NORMAL_DIR -p NON_TISSUE_MATCHED_NORMAL_DIR --gtf-file GENCODE_ANNOTATION_GTF --single-read-jx-json 1_READ_TCGA_JX_JSON --cancer-sra-directory SNAPTRON_CANCER_DIR --cancer-gene-census CANCER_GENE_CENSUS --oncokb-cancer-genes ONCOKB_GENES --min-overall-set-count 2 -o 2-SAMPLE_PIECHART_DIR

2-SAMPLE_PIECHART_DIR will now contain:

  1. To perform additional set membership analyses using set_membership_analysis.py (for use in Figures S1C and S1D), run:

    python3 set_membership_analysis.py -o OUTPUT_DIR -s FULL_PIECHART_DIR

FULL_PIECHART_DIR will now contain:

To tally SRA experiments using count_unique_SRA_expts.py, run:

    python3 count_unique_SRA_expts.py -c SNAPTRON_CANCER_EXPTLIST_DIR -n SNAPTRON_NONCANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR

Figure generation

Fig 1A:

    python3 fig1A_overall_set_barplots.py -s FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Fig 1B:

    python3 fig1B_ncn_jx_counts_per_sample.py -j COUNTS_PER_SAMPLE_DIRECTORY -o FIGURE_OUTPUT_DIRECTORY

Fig 1C:

    python3 fig1C_overall_set_prevalence_boxplot.py --set-memberships FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Fig 2A:

    python3 fig2A_TCGA_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR

Fig 2B:

    python3 fig2B_TCGA_subtypes_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR

Fig 2C:

    python3 fig2C_cell_of_origin_heatmap.py  -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -e SNAPTRON_NONCANCER_EXPTLIST_DIR

Figs 3A, S3A, S3E, and S3F:

(Contact on instructions for regenerating this figure: https://github.com/metamaden)

Instantiate the functions in upset_functions.R, and make a new subdirectory (default name: cxdat), and run get.datlist(), specifying the name of the subdirectory containing the junction data tables. Once this has run, the working directory should contain a large list object containing the junction data tables, as well as a newly populated subdirectory (default: cxdat). Depending on your session memory limit, you may need to remove the object cxdl from your environment before proceeding.

Follow the remaining steps in upset_data.R to: Define the new developmental junction category; Run jx.calcsets() for the supplemental data, then for the main figure data; Replace the unexplained category in the main figure data with the supplemental figure data; Generate the supplemental data tables. To save memory, tables are read processively and processed.

Next, follow the steps in upset_plots_original.R to: Specify the data table containing the whole/total group sets and subset group sets, respectively; Exclude the pre-filtered set columns from visualization; Properly order whole set columns by mean value; Automatically generate the axis limits and labels; Specify the group labels (should match whole set column labels after ordering on means); Generate the plot data objects; Generate the final plot images.

Following these steps should generate both a main and a supplemental upset plot with annotation set and subset abundances. It may be necessary to experiment with different plot and image dimensions in upset_plots_original.R and make.ggset.final() to refine the figure properties. For additional information, refer to the function docstrings and script comments.)

Fig 3B:

    python3 fig3B_antisense_boxplot.py -o FIGURE_OUTPUT_DIR -s FULL_PIECHART_DIR

Fig S1A:

    python3 figS1A_junction_burden_vs_TMB.py -o FIGURE_OUTPUT_DIR -j COUNTS_PER_SAMPLE_DIR -s RECOUNT_SAMPLE_IDS -p TCGA_PHEN -t TCGA_TMB

Fig S1B:

    python3 fig1B_S1B_ncn_jx_counts_per_sample.py -d -j FILTERED_NCN_JX_PER_SAMPLE_DIR -p FILTERED_NTM_JX_PER_SAMPLE_DIR -d -g THYM CESC UVM DLBC --prepared-sort-order -o FIGURE_OUTPUT_DIR

Figs S1C and S1D:

    python3 figS1CD_S3CD_junction_sharedness.py -d TCGA_PREVALENCE_DIR -o FIGURE_OUTPUT_DIR

Note: requires contents of TCGA_PREVALENCE_DIR created by running set_membership_analysis.py in step 6 above.

Fig S1E:

    python3 figS1E_set_prevalences_per_cancer.py -s FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Figs S1F, S1G, S1H, and S1I:

(Contact for instructions on regenerating this figure: https://github.com/weederb23)

  1. Download the TCGA mutation files from http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/ for each cancer type and move into the sub-directory ./TCGA_mut_files/, where ./ is the location of "SF_mutation_script.R" and "SF_sharedness_plots.R". For full list of the direct file paths for mutation call file downloads see "mutation_call_file_list.txt"

  2. Copy files from NON_CORE_NORMAL_JX_COORD_DIR to a new directory within ./ called "non_core_jxns/". Rename files from full cancer names to match TCGA cancer codes.

  3. Copy files from [PIECHART DIRECTORY] to a new directory within ./ called "non_core_jxn_annotations/". Rename files from full cancer names to match TCGA cancer codes.

  4. Run Rscript SF_mutation_script.R.

This generates fig_s1f.jpg, fig_s1g.jpg and shared_jxn_df.txt. shared_jxn_df.txt is used downstream for SF_sharedness_plots.R.

If run interactively, also presents statistics for junction comparisons between patients with and without splicing factor mutations across cancers.

  1. Run SF_sharedness_plots.R.

This generates fig_s1h.jpg and fig_s1i.jpg. If run interactively, also presents statistics for jxn comparisons between patients with and without splicing factor mutations across cancers.

Fig S1J:

    python3 figS1J_S3B_SRA_cancer_shared_prevalence.py --snaptron-results SNAPTRON_CANCER_DIR -e SNAPTRON_CANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR -d PREVALENCE_FILE_DIR

Fig S2A:

    python3 figS2A_full_TCGA_SRA_heatmap.py  -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -e SNAPTRON_NONCANCER_EXPTLIST_DIR

Fig S2B:

    python3 figS2B_SKCM_jx_similarity.py --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JXS_PER_SAMPLE_DIR -o FIGURE_OUTPUT_DIR

Fig S3B:

    python3 figS1J_S3B_SRA_cancer_shared_prevalence.py --snaptron-results SNAPTRON_CANCER_DIR -e SNAPTRON_CANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR -d UNEXPLAINED_PIECHART_DIR --unexplained-junctions

Fig S3C and S3D:

    python3 figS1CD_S3CD_junction_sharedness.py -d UNEXPLAINED_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Fig S3G:

    python3 figS3G_prepare_survival_data.py -d DB_DIR -o FIGURE_OUTPUT_DIR -P TCGA_PHEN       
    Rscript figS3G_survival_curve.R FIGURE_OUTPUT_DIR