shared-cancer-splicing

Reproduces analyses and figures from "Putatively cancer-specific exon–exon junctions are shared across patients and present in developmental and other non-cancer cells" by David et al.

Requirements

Software

We ran Python scripts with Python 3.6.6, which we installed as part of Anaconda 5.2.0. We made use of the third-party modules intervaltree 2.1.0 and seaborn 0.9.0.

We ran R scripts with R 3.6.1 and made use of Data.table 1.12.2, ggplot2 3.2.1, gridExtra 2.3, RColorBrewer 1.1-2, dplyr 0.8.3, survminer 0.4.6, survival 2.44-1.1, ggpubr 0.2.4, and magrittr 1.5.

We further used Snaptron via its qs utility. To obtain it, run

git clone https://github.com/ChristopherWilks/snaptron-experiments

Data

We downloaded exon-exon junction BEDs for GTEx and TCGA and accompanying metadata from recount2:

GTEx junction IDs (GTEX_JUNCTION_BED)
GTEx samples with junctions (GTEX_JUNCTION_COVERAGE)
GTEx phenotype file (GTEX_PHEN)
TCGA junction IDs (TCGA_JUNCTION_BED)
TCGA samples with junctions (TCGA_JUNCTION_COVERAGE)
TCGA phenotype file (TCGA_PHEN)
recount sample IDs (RECOUNT_SAMPLE_IDS)

We also used:

GENCODE v28 (GENCODE_GTF)
TCGA tumor mutation burden calls (TCGA_TMB)
COSMIC's cancer gene census (after login, this should download as a CSV; CANCER_GENE_CENSUS)
OncoKB cancer gene list (ONCOKB_GENES)
Patient somatic mutation call files: “Mutation_Packager_Oncotated_Calls” tar.gz for each cancer type from http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/ (For full list of the direct file paths for mutation call file downloads see the file mutation_call_file_list.txt in the splicing_factor_mutation_analysis directory.)
UniProt splicing-associated gene file (TSV): keyword:"mRNA splicing [KW-0508]" & Reviewed:yes & organism:"Homo sapiens (Human) [9606]" from https://www.uniprot.org/uniprot/

Note we also made use of metaSRA's ontology terms obtained via its API and available in https://github.com/JulianneDavid/shared-cancer-splicing/tree/master/SRA_junction_download/MetaSRA_run_files .

Execution

Preliminaries

Create a new directory to store your database (mkdir DB_DIR) and then create TCGA and GTEx junction index by running jx_indexer.py in index mode:
```
python3 jx_indexer.py -d DB_DIR index -c GTEx_JUNCTION_COVERAGE -C TCGA_JUNCTION_COVERAGE -b GTEx_JUNCTION_BED -B TCGA_JUNCTION_BED -p GTEx_PHEN -P TCGA_PHEN -s RECOUNT_SAMPLE_IDS -g GENCODE_ANNOTATION_GTF
```
Note: steps 1 and 2 require significant temporary file space to execute. Please ensure that your system has ~20GB available space in the directory where temporary files are stored, or assign temp file storage to a new directory. If your temporary directory does not have enough available space, you will be directed to re-run step 1 in "index-only" mode, the same command as above but with the addition of the "-i" flag. This is NOT neccessary if your first index mode (step 1) run was successful.
Run jx_indexer.py in experiment mode on junction index to collect junction files for analysis:
```
python3 jx_indexer.py -d DB_DIR experiment -o OUTPUT_DIR
```

OUTPUT_DIR will now contain:

all_jxs (ALL_JX_DIR) for use in set membership annotation;
non-core-normal_jxs_per_sample (NON_CORE_NORMAL_DIR) and non-tissue-matched_jxs_per_sample (NON_TISSUE_MATCHED_NORMAL_DIR) for use in set membership annotation;
non-core-normal_counts_per_sample (COUNTS_PER_SAMPLE_DIR) (junctions not found in core normals) to generate Figure 1B;
non-core-normal_jx_prevalences (PREVALENCE_FILE_DIR) (TCGA cancer junctions not found in core normals with cancer-type prevalence, per cancer type) to generate Figures 2A, 2B, 2C, and S2A;
FigS1B_non-core-normal_counts_per_sample (FILTERED_NCN_JX_PER_SAMPLE_DIR) (coverage- and annotation-filtered junctions not found in GTEx or TCGA tissue-matched normal samples) and FigS1B_non-tissue-matched_counts_per_sample (FILTERED_NTM_JX_PER_SAMPLE_DIR) (coverage- and annotation-filtered junctions not found in core normals) to generate Figure S1B;
jx_non-core-normal_jxs_per_sample (NON_CORE_NORMAL_JX_COORD_DIR) for use in Figures S1F, S1G, S1H, and S1I;
(all_jxs_per_sample_paired_normals) ALL_JXS_PER_SAMPLE_DIR containing all junctions for TCGA SKCM normal samples, TCGA SKCM tumor samples, and GTEx bulk skin normal samples to generate Figure S2B.

Call this directory METASRA_QUERY_FILES. Run:

python3 create_snaptron_queries.py -s METASRA_QUERY_FILES -i RECOUNT_SAMPLE_IDS -o LOGGING_OUTPUT_DIR -p SRA_JUNCTION_OUTPUT_DIR -u UTILITIES_DIR

where UTILITIES_DIRECTORY is where snaptron-experiments was previously cloned (see Requirements). SRA_JUNCTION_OUTPUT_DIR is where the results will appear.

Run each SAMPLE_TYPE_indiv_chrom_batch_query_script.sh in SRA_JUNCTION_OUTPUT_DIR to collect SRA junctions.

SRA_JUNCTION_OUTPUT_DIR will now contain:

SRA_noncancer_rawresults (SNAPTRON_NONCANCER_DIR) for non-cancer junction results
SRA_noncancer_exptlists (SNAPTRON_NONCANCER_EXPTLIST_DIR) for non-cancer experiment lists
SRA_cancer_rawresults (SNAPTRON_CANCER_DIR) for cancer junction results
SRA_cancer_exptlists (SNAPTRON_CANCER_EXPTLIST_DIR) for cancer experiment lists

Run collect_1-read_TCGA_jxs.py to collect single-read TCGA junction collection:

python3 collect_1-read_TCGA_jxs.py -C TCGA_JUNCTION_COVERAGE -B TCGA_JUNCTION_BED -o JSON_DIR

JSON_DIR will now contain:

single-read-tcga-jxs_json.txt containing single-read junctions for use in set membership annotation runs below, using set_membership_annotation.py, as follows:

python3 set_membership_annotation.py --db-path DB_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JX_DIR -g NON_CORE_NORMAL_DIR -p NON_TISSUE_MATCHED_NORMAL_DIR --gtf-file GENCODE_ANNOTATION_GTF --single-read-jx-json JSON_DIR/single-read-tcga-jxs_json.txt --cancer-sra-directory SNAPTRON_CANCER_DIR --cancer-gene-census CANCER_GENE_CENSUS --oncokb-cancer-genes ONCOKB_GENES --min-overall-set-count 1 -o FULL_PIECHART_DIR

FULL_PIECHART_DIR will now contain:

unexplained (UNEXPLAINED_DIR) containing unexplained subset of all single-read junctions
developmental (DEVELOPMENTAL_DIR) containing developmentally-occurring subset of all single-read junctions

To generate two-sample minimum junction sets (used in Figures S3E and S3F), run:

    python3 set_membership_annotation.py --db-path DB_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JX_DIR -g NON_CORE_NORMAL_DIR -p NON_TISSUE_MATCHED_NORMAL_DIR --gtf-file GENCODE_ANNOTATION_GTF --single-read-jx-json 1_READ_TCGA_JX_JSON --cancer-sra-directory SNAPTRON_CANCER_DIR --cancer-gene-census CANCER_GENE_CENSUS --oncokb-cancer-genes ONCOKB_GENES --min-overall-set-count 2 -o 2-SAMPLE_PIECHART_DIR

2-SAMPLE_PIECHART_DIR will now contain:

unexplained (UNEXPLAINED_2-SAMPLE_DIR) containing unexplained subset of all two-sample junctions
developmental (DEVELOPMENTAL_2-SAMPLE_DIR) containing developmentally-occurring subset of all two-sample junctions

To perform additional set membership analyses using set_membership_analysis.py (for use in Figures S1C and S1D), run:
```
python3 set_membership_analysis.py -o OUTPUT_DIR -s FULL_PIECHART_DIR
```

FULL_PIECHART_DIR will now contain:

unexplained (UNEXPLAINED_DIR) containing unexplained subset of all single-read junctions
developmental (DEVELOPMENTAL_DIR) containing developmentally-occurring subset of all single-read junctions
true_TCGA_prevalence_files (TCGA_PREVALENCE_DIR) containing non-core-normal junctions

To tally SRA experiments using count_unique_SRA_expts.py, run:

    python3 count_unique_SRA_expts.py -c SNAPTRON_CANCER_EXPTLIST_DIR -n SNAPTRON_NONCANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR

Figure generation

Fig 1A:

    python3 fig1A_overall_set_barplots.py -s FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Fig 1B:

    python3 fig1B_ncn_jx_counts_per_sample.py -j COUNTS_PER_SAMPLE_DIRECTORY -o FIGURE_OUTPUT_DIRECTORY

Fig 1C:

    python3 fig1C_overall_set_prevalence_boxplot.py --set-memberships FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Fig 2A:

    python3 fig2A_TCGA_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR

Fig 2B:

    python3 fig2B_TCGA_subtypes_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR

Fig 2C:

    python3 fig2C_cell_of_origin_heatmap.py  -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -e SNAPTRON_NONCANCER_EXPTLIST_DIR

Figs 3A, S3A, S3E, and S3F:

(Contact on instructions for regenerating this figure: https://github.com/metamaden)

Instantiate the functions in upset_functions.R, and make a new subdirectory (default name: cxdat), and run get.datlist(), specifying the name of the subdirectory containing the junction data tables. Once this has run, the working directory should contain a large list object containing the junction data tables, as well as a newly populated subdirectory (default: cxdat). Depending on your session memory limit, you may need to remove the object cxdl from your environment before proceeding.

Follow the remaining steps in upset_data.R to: Define the new developmental junction category; Run jx.calcsets() for the supplemental data, then for the main figure data; Replace the unexplained category in the main figure data with the supplemental figure data; Generate the supplemental data tables. To save memory, tables are read processively and processed.

Next, follow the steps in upset_plots_original.R to: Specify the data table containing the whole/total group sets and subset group sets, respectively; Exclude the pre-filtered set columns from visualization; Properly order whole set columns by mean value; Automatically generate the axis limits and labels; Specify the group labels (should match whole set column labels after ordering on means); Generate the plot data objects; Generate the final plot images.

Following these steps should generate both a main and a supplemental upset plot with annotation set and subset abundances. It may be necessary to experiment with different plot and image dimensions in upset_plots_original.R and make.ggset.final() to refine the figure properties. For additional information, refer to the function docstrings and script comments.)

Fig 3B:

    python3 fig3B_antisense_boxplot.py -o FIGURE_OUTPUT_DIR -s FULL_PIECHART_DIR

Fig S1A:

    python3 figS1A_junction_burden_vs_TMB.py -o FIGURE_OUTPUT_DIR -j COUNTS_PER_SAMPLE_DIR -s RECOUNT_SAMPLE_IDS -p TCGA_PHEN -t TCGA_TMB

Fig S1B:

    python3 fig1B_S1B_ncn_jx_counts_per_sample.py -d -j FILTERED_NCN_JX_PER_SAMPLE_DIR -p FILTERED_NTM_JX_PER_SAMPLE_DIR -d -g THYM CESC UVM DLBC --prepared-sort-order -o FIGURE_OUTPUT_DIR

Figs S1C and S1D:

    python3 figS1CD_S3CD_junction_sharedness.py -d TCGA_PREVALENCE_DIR -o FIGURE_OUTPUT_DIR

Note: requires contents of TCGA_PREVALENCE_DIR created by running set_membership_analysis.py in step 6 above.

Fig S1E:

    python3 figS1E_set_prevalences_per_cancer.py -s FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Figs S1F, S1G, S1H, and S1I:

(Contact for instructions on regenerating this figure: https://github.com/weederb23)

Download the TCGA mutation files from http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/ for each cancer type and move into the sub-directory ./TCGA_mut_files/, where ./ is the location of "SF_mutation_script.R" and "SF_sharedness_plots.R". For full list of the direct file paths for mutation call file downloads see "mutation_call_file_list.txt"
Copy files from NON_CORE_NORMAL_JX_COORD_DIR to a new directory within ./ called "non_core_jxns/". Rename files from full cancer names to match TCGA cancer codes.
Copy files from [PIECHART DIRECTORY] to a new directory within ./ called "non_core_jxn_annotations/". Rename files from full cancer names to match TCGA cancer codes.
Run Rscript SF_mutation_script.R.

This generates fig_s1f.jpg, fig_s1g.jpg and shared_jxn_df.txt. shared_jxn_df.txt is used downstream for SF_sharedness_plots.R.

If run interactively, also presents statistics for junction comparisons between patients with and without splicing factor mutations across cancers.

Run SF_sharedness_plots.R.

This generates fig_s1h.jpg and fig_s1i.jpg. If run interactively, also presents statistics for jxn comparisons between patients with and without splicing factor mutations across cancers.

Fig S1J:

    python3 figS1J_S3B_SRA_cancer_shared_prevalence.py --snaptron-results SNAPTRON_CANCER_DIR -e SNAPTRON_CANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR -d PREVALENCE_FILE_DIR

Fig S2A:

    python3 figS2A_full_TCGA_SRA_heatmap.py  -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -e SNAPTRON_NONCANCER_EXPTLIST_DIR

Fig S2B:

    python3 figS2B_SKCM_jx_similarity.py --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JXS_PER_SAMPLE_DIR -o FIGURE_OUTPUT_DIR

Fig S3B:

    python3 figS1J_S3B_SRA_cancer_shared_prevalence.py --snaptron-results SNAPTRON_CANCER_DIR -e SNAPTRON_CANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR -d UNEXPLAINED_PIECHART_DIR --unexplained-junctions

Fig S3C and S3D:

    python3 figS1CD_S3CD_junction_sharedness.py -d UNEXPLAINED_PIECHART_DIR -o FIGURE_OUTPUT_DIR

Fig S3G:

    python3 figS3G_prepare_survival_data.py -d DB_DIR -o FIGURE_OUTPUT_DIR -P TCGA_PHEN       
    Rscript figS3G_survival_curve.R FIGURE_OUTPUT_DIR

JulianneDavid / shared-cancer-splicing

readme