AlexandrovLab / SigProfilerTopography

SigProfilerTopography allows evaluating the effect of chromatin organization, histone modifications, transcription factor binding, DNA replication, and DNA transcription on the activities of different mutational processes. SigProfilerTopography elucidates the unique topographical characteristics of mutational signatures.
BSD 2-Clause "Simplified" License
18 stars 1 forks source link
bioinformatics cancer-genomics mutation-analysis mutation-topography mutational-signatures somatic-variants

License Docs Build Status Uptime Robot status

schematic

SigProfilerTopography

SigProfilerTopography allows evaluating the effect of chromatin organization, histone modifications, transcription factor binding, DNA replication, and DNA transcription on the activities of different mutational processes. SigProfilerTopography elucidates the unique topographical characteristics of mutational signatures. The tool seamlessly integrates with other SigProfiler tools including SigProfilerMatrixGenerator, SigProfilerSimulator, and SigProfilerAssignment. Detailed documentation can be found at: https://osf.io/5unby/wiki/home/

SigProfilerTopography provides topography analyses for mutations such as

and carries out following analyses:

PREREQUISITES

The framework is written in PYTHON, however, it also requires the following software with the given versions (or newer):

QUICK START GUIDE

This section will guide you through the minimum steps required to run SigProfilerTopography:

  1. For most recent stable PyPI version of this tool, install the python package using pip:
    $ pip install SigProfilerTopography

    If you have installed SigProfilerTopography before, upgrade using pip:

    $ pip install SigProfilerTopography --upgrade
  1. Imports the example data that is provided by SigProfilerTopography. This data can be used to run the example program and ensure that the environment is set up.

    >>> from SigProfilerTopography import Topography as topography
    >>> topography.install_example_data()

    Imports 21BRCA.zip under the current working directory. Once 21BRCA.zip has been downloaded, unzip the file. The unzipped 21BRCA folder contains two folders: 21BRCA_vcfs and 21BRCA_probabilities. The folder 21BRCA_vcfs contains 21 VCF files (one per each breast cancer sample) in GRCh37 and 21BRCA_probabilities` contains probability matrix files for single base substitutions and doublet base substitutions.

  2. Install your desired reference genome from the command line/terminal as follows (available reference genomes are: GRCh37, GRCh38, mm9, and mm10):

    $ python
    >>> from SigProfilerMatrixGenerator import install as genInstall
    >>> genInstall.install('GRCh37')

    This will install the human 37 assembly as a reference genome.

  3. Imports the nucleosome library file that is necessary for nucleosome occupancy analyses. Next, choose the genome that you would like to import:

    >>> from SigProfilerTopography import Topography as topography
    >>> topography.install_nucleosome("GRCh37")

    By default, install_nucleosome imports nucleosome data of K562 cell line for GRCh37 and GRCh38 genome assemblies.

  4. Imports the open chromatin library file that is necessary for epigenomics analyses. Next, choose the genome that you would like to import:

    >>> from SigProfilerTopography import Topography as topography
    >>> topography.install_atac_seq("GRCh37")

    By default, install_atac_seq imports open chromatin data of breast epithelium tissue for GRCh37 and left lung tissue for GRCh38.

  5. Imports the replication timing library file that is necessary for replication timing analyses. Next, choose the genome that you would like to import:

    >>> from SigProfilerTopography import Topography as topography
    >>> topography.install_repli_seq("GRCh37")

    By default, install_repli_seq imports replication time data of MCF7 and IMR90 for GRCh37 and GRCh38, respectively.

  6. Conducts topography analyses for your samples. Here is an example of a call to runAnalyses that generates all of the different analyses.

    >>> from SigProfilerTopography import Topography as topography
    
    >>> genome = "GRCh37"
    >>> inputDir = "path/to/21BRCA_vcfs"
    >>> outputDir = "path/to/results"
    >>> jobname = "21BRCA_SPT"
    >>> numofSimulations = 5
    
    >>> if __name__ == "__main__":
            topography.runAnalyses(genome, 
                       inputDir, 
                       outputDir, 
                       jobname, 
                       numofSimulations, 
                       epigenomics=True,
                       nucleosome=True, 
                       replication_time=True, 
                       strand_bias=True, 
                       processivity=True)

    If probability files are not provided, SigProfilerTopography utilizes SigProfilerAssignment by default to attribute the activities of known reference mutational signatures from the Catalogue Of Somatic Mutations In Cancer (COSMIC) database to each examined sample.

  7. Here is an example of a call to runAnalyses with probability files using the 21 VCF files located in the subfolder 21BRCA_vcfs as input and providing the probability files in the subfolder 21BRCA_probabilities.

    >>> from SigProfilerTopography import Topography as topography
    
    >>> genome = "GRCh37"
    >>> inputDir = "path/to/21BRCA_vcfs"
    >>> outputDir = "path/to/results"
    >>> jobname = "21BRCA_SPT_with_probability_matrices"
    >>> numofSimulations = 5
    >>> sbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_SBS96_Decomposed_Mutation_Probabilities.txt"
    >>> dbs_probability_file = "path/to/21BRCA_probabilities/COSMIC_DBS78_Decomposed_Mutation_Probabilities.txt"
    
    >>> if __name__ == "__main__":
            topography.runAnalyses(genome, 
                       inputDir, 
                       outputDir, 
                       jobname, 
                       numofSimulations, 
                       sbs_probabilities = sbs_probability_file,
                       dbs_probabilities = dbs_probability_file,
                       epigenomics=True,
                       nucleosome=True, 
                       replication_time=True, 
                       strand_bias=True, 
                       processivity=True)

    SigProfilerTopography utilizes probability matrix files containing the probability of each signature to cause a specific mutation type in a cancer sample.

View the table below for the full list of runAnalyses parameters.

PARAMETERS Category Parameter Variable Type Parameter Description
Required
genome String The reference genome used for the topography analyses. Accepted values include: {"GRCh37", "GRCh38", "mm10"}.
inputDir String The path to the directory containing the input files. SigProfilerTopography accepts all input files that SigProfilerMatriXGenerator can process.
outputDir String The path of the directory where the output will be saved. If this directory doesn't exist, a new one will be created.
jobname String The name of the directory containing all of the outputs under outputDir/jobname. If this directory doesn't exist, a new one will be created.
numofSimulations Integer The number of simulations to be created.
Optional
epigenomics Boolean Generate epigenomics analysis when True. By default, this is set to False.
nucleosome Boolean Generate nucleosome occupancy analysis when True. By default, this is set to False.
replication_time Boolean Generate replication timing analysis when True. By default, this is set to False.
strand_bias Boolean Generate replication and transcription strand asymmetry analysis when True. By default, this is set to False.
replication_strand_bias Boolean Generate replication strand asymmetry analysis when True. By default, this is set to False.
transcription_strand_bias Boolean Generate transcription strand asymmetry analysis (including genic versus intergenic regions) when True. By default, this is set to False.
processivity Boolean Generate strand-coordinated mutagenesis when True. By default, this is set to False.
epigenomics_files List of Strings Python list of paths for each epigenomics library file utilized in the epigenomics analysis. By default, epigenomics files of open chromatin, CTCF and histone modifications attained from "breast_epithelium" and "lung" tissue are utilized for GRCh37 and GRCh38, respectively.
epigenomics_dna_elements List of Strings Python list of unique DNA element names for the epigenomics files utilized in the epigenomics analysis. Each DNA element name must be contained in at least one epigenomics library filename. E.g., DNA element is 'CTCF' for the epigenomics file of 'ENCFF782GCQ_breast_epithelium_Normal_CTCF-human.bed'. By default, DNA elements of ['H3K27me3', 'H3K36me3', 'H3K9me3', 'H3K27ac', 'H3K4me1', 'H3K4me3', 'CTCF', 'ATAC'] are utilized for GRCh37 and GRCh38. If user provided epigenomics_files is provided, then epigenomics_dna_elements is mandatory.
epigenomics_biosamples List of Strings Python list of unique biosample names for the epigenomics files utilized in the epigenomics analyses. Each biosample name must be contained in at least one epigenomics library filename. E.g., biosample is 'breast_epithelium' for the epigenomics file of 'ENCFF782GCQ_breast_epithelium_Normal_CTCF-human.bed'. By default, "breast_epithelium" and "lung" biosamples are utilized for GRCh37 and GRCh38, respectively. Biosamples are shown in the epigenomics heatmaps if plot_detailed_epigemomics_heatmaps is set to True.
nucleosome_biosample String Biosample that will be used for nucleosome occupancy analysis. Analysis can be done by using either K562 or GM12878 cell line from ENCODE. By default, the K562 cell line is used for GRCh37 and GRCh38.
nucleosome_file String The path to the nucleosome occupancy library file that will be used for the analysis. By default, nucleosome occupancy file (MNase-seq) of K562 cell line is used for GRCh37 and GRCh38.
replication_time_biosample String Biosample that will be used to carry out replication timing and replication strand asymmetry analyses. By default, MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively. For the complete list of available replication time biosamples, refer to the Replication Time Biosamples table below.
replication_time_signal_file String The path to the replication time signal file. By default, replication time signal file (wig file) of MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively.
replication_time_valley_file String The path to the replication time valley file. By default, replication time valley file (bed file) of MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively.
replication_time_peak_file String The path to the replication time peak file. By default, replication time peak file (bed file) of MCF7 and IMR90 cell lines are utilized for GRCh37 and GRCh38, respectively.
samples_of_interest List of Strings Conduct topography analyses for these samples of interest only. By default, it is set to None and topography analyses are carried out for all samples.
discreet_mode Boolean Each mutation contributes to the topography analyses either with 1 or 0 when True; otherwise, each mutation contributes with its probability when False. By default, this is set to True.
average_probability Float The average probability of the mutations assigned to a SBS, DBS, and ID signature. By default, it is set to 0.90. The average_probability applies when discreet_mode is True. We set signature specific cutoffs, such that for the mutations satisfying mutation_signature_probability >= cutoff, average probability of these mutations must be at least 0.90.
num_of_sbs_required Integer The minimum required number of mutations for a SBS signature. The num_of_sbs_required applies when discreet_mode is True or when discreet_mode is False and show_all_signatures is False. By default, it is set to 2000.
num_of_dbs_required Integer The minimum required number of mutations for a DBS signature. The num_of_dbs_required applies when discreet_mode is True or when discreet_mode is False and show_all_signatures is False. By default, it is set to 200.
num_of_id_required Integer The minimum required number of mutations for a ID signature. The num_of_id_required applies when discreet_mode is True or when discreet_mode is False and show_all_signatures is False. By default, it is set to 1000.
exceptional_signatures Dictionary The dictionary of exceptional signatures. The exceptional_signatures applies when discreet_mode is True. E.g., exceptional_signatures = {"SBS32" : 0.63} is a Python dictionary where key is a mutational signature and value is an average probability. Exceptional signatures are included in the topography analyses if they satisfy num_of_sbs_required, num_of_dbs_required, and num_of_id_required constraints with average_probability >= given average probability.
default_cutoff Float The default_cutoff applies for all signatures when discreet_mode is False. Mutations satisfying mutation_signature_probability >= default_cutoff are considered in the topography analyses with their probability. By default, it is set to 0.5.
show_all_signatures Boolean The show_all_signatures applies when discreet_mode is False. All signatures are considered in the topography analyses when True, otherwise signatures satisfying num_of_sbs_required, num_of_dbs_required, and num_of_id_required are considered in the topography analyses when False. By default, it is set to True.
plot_figures Boolean Generate plots displaying the results of all topography analyses when True. By default, this is set to True.
plot_epigenomics Boolean Generate epigenomics heatmaps and occupancy plots when True. By default, this is set to False.
plot_nucleosome Boolean Generate nucleosome occupancy plots when True. By default, this is set to False.
plot_replication_time Boolean Generate replication timing plots when True. By default, this is set to False.
plot_strand_bias Boolean Generate replication strand asymmetry, transcription strand asymmetry, genic versus intergenic regions plots when True. By default, this is set to False.
plot_replication_strand_bias Boolean Generate replication strand asymmetry plots when True. By default, this is set to False.
plot_transcription_strand_bias Boolean Generate transcription strand asymmetry and genic versus intergenic regions plots when True. By default, this is set to False.
plot_processivity Boolean Generate strand-coordinated mutagenesis plots when True. By default, this is set to False.
step1_matgen_real_data Boolean Run SigProfilerMatrixGenerator to generate matrices for the real mutations when True. By default, this is set to True.
step2_gen_sim_data Boolean Run SigProfilerSimulator to generate simulated mutations when True. By default, this is set to True.
step3_matgen_sim_data Boolean Run SigProfilerMatrixGenerator to generate matrices for the simulated mutations when True. By default, this is set to True.
step4_merge_prob_data Boolean Merge real and simulated mutations with the probabilities files when True. By default, this is set to True.
step5_gen_tables Boolean Generate tables for providing information on mutational signatures, cutoffs, number of mutations and average probability when True. By default, this is set to True.
sbs_probabilities String The path to the probabilities matrix file. The probabilities matrix includes the probabilities of each mutation type in each sample. The first column lists all the samples, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective SBS signatures where the sum of each row is 1. The probabilities file can be in SBS_6, SBS_24 SBS_96, SBS_192, SBS_288, SBS_384, SBS_1536, or SBS_6144 context produced by mutational signature extractor.
dbs_probabilities String The path to the probabilities matrix file. The probabilities matrix includes the probabilities of each mutation type in each sample. The first column lists all the samples, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective DBS signatures where the sum of each row is 1. The probabilities file in DBS-78 context produced by mutational signature extractor.
id_probabilities String The path to the probabilities matrix file. The probabilities matrix includes the probabilities of each mutation type in each sample. The first column lists all the samples, the second column lists all the mutation types, and the following columns list the calculated probability value for the respective ID signatures where the sum of each row is 1. The probabilities file in ID-83 context produced by mutational signature extractor.
sbs_signatures String The path to the signatures matrix file. The signatures matrix contains the distribution of mutation types in the SBS mutational signatures. The first column lists all of the mutation types. e.g., There are 96 possible mutations that are considered for the SBS-96 context. The following columns are the SBS signatures. The sum of each column is 1, and each value in a column indicates the proportion of a mutational context in the signature.
dbs_signatatures String The path to the signatures matrix file. The signatures matrix contains the distribution of mutation types in the DBS mutational signatures. The first column lists all of the mutation types. e.g., There are 78 possible mutations that are considered for the DBS-78 context. The following columns are the DBS signatures. The sum of each column is 1, and each value in a column indicates the proportion of a mutational context in the signature.
id_signatures String The path to the signatures matrix file. The signatures matrix contains the distribution of mutation types in the ID mutational signatures. The first column lists all of the mutation types. e.g., There are 83 possible mutations that are considered for the ID-83 context. The following columns are the ID signatures. The sum of each column is 1, and each value in a column indicates the proportion of a mutational context in the signature.
sbs_activities String The path to the activities matrix file. The activity matrix for the selected SBS signatures. The first column lists all of the samples and the second and the following columns list the calculated activity value (number of mutations) for the respective SBS signatures.
dbs_activities String The path to the activities matrix file. The activity matrix for the selected DBS signatures. The first column lists all of the samples and the second and the following columns list the calculated activity value (number of mutations) for the respective DBS signatures.
id_activities String The path to the activities matrix file. The activity matrix for the selected ID signatures. The first column lists all of the samples and the second and the following columns list the calculated activity value (number of mutations) for the respective ID signatures.
verbose Boolean Set to True for detailed debugging messages. By default, this is set to False.
parallel_mode Boolean Set to True for running SigProfilerTopography using multiprocessing. By default, this is set to True.
plusorMinus_epigenomics Integer The number of bases considered before and after mutation start for epigenomics occupancy analysis.
plusorMinus_nucleosome Integer The number of bases considered before and after mutation start for nucleosome occupancy analysis.
epigenomicsheatmap
significance_level
Float Corrected p-values <= epigenomics_heatmap_significance_level are considered statistically significant. By default, this is set to 0.05.
fold_change_window_size Integer In epigenomics analysis, fold change of real versus simulated mutations is calculated for the window size centered at the mutation start. E.g., for window size of 100 bases, ± 50 bases are considered before and after mutation start. By default, this is set to 100.
num_of_avg_overlap_required Integer The minimum required average number of overlaps between the mutations and the regions outlined in the epigenomics files. By default, set to 100.
plot_detailedepigemomics
heatmaps
Boolean Plot detailed epigenomics heatmaps when True. By default, set to False.
remove_dna_elements_withall
nans_in_epigemomics_heatmaps
Boolean Remove the DNA elements from the epigenomics heatmap if no result exists. By default, set to True.
odds_ratio_cutoff Float Strand asymmetries with odd ratio >= odds_ratio_cutoff are shown in the strand asymmetry circle plots. By default, set to 1.1.
percentage_ofreal
mutations_cutoff
Float Strand asymmetries of the SBS signatures with percentage of the mutations >= percentage_of_real_mutations_cutoff are shown in the plots. By default, set to 5.
ylim_multiplier Float Multiply the y-axis view limits with ylim_multiplier in strand asymmetry bar plots. By default, set to 1.25.
processivityinter
mutational_distance
Integer Consecutive mutations with distance <= processivity_inter_mutational_distance are considered for the strand-coordinated mutagenesis. By default, set to 10000.
processivity_significance_level Float Corrected p-values <= processivity_significance_level are considered statistically significant for strand coordinated mutagenesis. By default, this is set to 0.05.
delete_chrbased_files Boolean To reduce the disk space usage of the tool, SigProfilerTopography deletes the chrbased files under outputDir/jobname/data/chrbased. By default, set to True.
exome Boolean SigProfilerSimulator simulates on the exome of the reference genome. By default, set to None.
updating Boolean SigProfilerSimulator updates the chromosome with each mutation. By default, set to False.
bed_file String SigProfilerSimulator simulates on custom regions of the genome. Requires the full path to the BED file. By default, set to None.
overlap Boolean SigProfilerSimulator allows overlapping of mutations along the chromosome. By default, set to False.
gender String SigProfilerSimulator simulates male or female genomes. By default, set to 'female'.
seed_file String SigProfilerSimulator uses this path to user defined seeds. One seed is required per processor. Uses a built in file by default. By default, this is set to None.
noisePoisson Boolean SigProfilerSimulator adds poisson noise to the simulations. By default, set to False.
noiseUniform Integer SigProfilerSimulator adds a noise dependent on a +/- allowance of noise (e.g., noiseUniform=5 allows +/-2.5% of mutations for each mutation type). By default, this is set to 0.
cushion Integer SigProfilerSimulator allows cushion when simulating on the exome or targetted panel. By default, this is set to 100 base pairs.
region String For SigProfilerSimulator. Path to targetted region panel for simulated on a user-defined region. Default is whole-genome simulations.
vcf Boolean SigProfilerSimulator outputs simulated samples as vcf files with one file per iteration per sample when True. SigProfilerSimulator outputs all samples from an iteration into a single maf file when False. By default, this is set to False.
mask String For SigProfilerSimulator. Path to probability mask file. A mask file format is tab-separated with the following required columns: Chromosome, Start, End, Probability. Note: Mask parameter does not support exome data where bed_file flag is set to true, and the following header fields are required: Chromosome, Start, End, Probability. By default, this is set to None.

SigProfilerTopography Output To learn about the output, please visit https://osf.io/5unby/wiki/home/

Replication Time Biosamples For GRCh37 and GRCh38, SigProfilerTopography provides Repli-seq files of the biosamples listed in the table below, which are the valid parameter values for replication_time_biosample.

Biosample Organism Tissue Cell Type Diseases
MCF7 human breast mammary Cancer
HEPG2 human liver liver cells (hepatocytes) Cancer
HELAS3 human cervix epithelial-like cervical cells Cancer
SKNSH human brain neuronal-like cells Cancer
K562 human bone marrow lymphoblast cells Cancer
IMR90 human lung fibroblast Normal
NHEK human skin keratinocyte Normal
BJ human skin fibroblast Normal
HUVEC human skin fibroblast Normal
BG02ES human early developmental stage of an embryo, not from a differentiated tissue embyronic stem cell None reported
GM12878 human blood B-Lymphocyte Normal
GM06990 human blood B-Lymphocyte Unknown
GM12801 human blood B-Lymphocyte Unknown
GM12812 human blood B-Lymphocyte Unknown
GM12813 human blood B-Lymphocyte Unknown
HEK293 human kidney embryonic kidney cells Normal
HCT116 human colon colorectal carcinoma cell Cancer
A549 human lung epithelial cell Cancer
CAKI2 human kidney papillary renal cell carcinoma cell Cancer
G401 human kidney epithelial kidney cells Cancer
T47D human breast; mammary gland epithelial cell Cancer
SKNMC human brain peripheral primitive neuroectodermal Cancer (Askin tumor)
NCIH460 human lung lung carcinoma cell Cancer

Citation

Otlu B, Alexandrov LB: Evaluating topography of mutational signatures with SigProfilerTopography. BioRxiv 2024, https://doi.org/10.1101/2024.01.08.574683.

Otlu B, Diaz-Gay M, Vermes I, Bergstrom EN, Zhivagui M, Barnes M, Alexandrov LB: Topography of mutational signatures in human cancer. Cell Rep 2023, https://doi.org/10.1016/j.celrep.2023.112930.

Copyright

This software and its documentation are copyright 2018 as a part of the SigProfiler project. The SigProfilerTopography framework is free software and is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Contact Information

Please address any queries or bug reports to Burcak Otlu at burcakotlu@eng.ucsd.edu