signature.tools.lib R package

Table of content

Introduction to the package
Versions
How to cite us
Systems Requirements
Testing the package
How to use this package
Package documentation
Functions provided by the package
Command line scripts
Examples
Frequently Asked Questions

Introduction to the package

signature.tools.lib is an R package for mutational signatures analysis. Mutational signatures are patterns of somatic mutations that reveal what mutational processes have been active in a cell. Mutational processes can be caused by exposure to mutagens, such as chemicals present in cigarettes, or defects in DNA repair pathways, such as homologous recombination repair.

The package supports reference genomes hg19, hg38 and mm10. It provides our latest algorithms for signature fit and extraction, as well as various utility functions and the HRDetect pipeline. The list and description of the most important functions is given below.

Versions

2.4.5

fixes and improvements

2.4.4

replication and transcription strand bias functions

2.4.3

it is now possible to specify organ="Other" in FitMS and HRDetect pipeline

2.4.2

fixes and improvements

2.4.1

added command line script solutionSelectionForFitMS, for manual selection of alternative FitMS solutions

2.4.0

genomeChart and genomeChartSV functions added, as well as a genomeChart command line script
findKataegis function added
Introduced signature fit parameters to indicate minimum number of mutations required for exposures (threshold_nmuts and giniThresholdScaling_nmuts)
Added function assignSignatureProbabilityToMutations, which estimates the probability that a specific mutation originates from each of the fitted signatures

2.3.0

Added FitMS common signature tiers T1-T3. T1 fits organ-specific signatures, T2 fits the corresponding reference signatures. T2 is useful in organs where there are mixed organ-specific signatures (e.g. SBS1+18), so that the signatures composing the mix can be fitted separately (e.g. SBS1 and SBS18 instead of SBS1+18). T3 is a combination of T1 and T2, where organ-specific signatures are used and only the mixed signatures are replaced with the corresponding reference signatures
Extended the FitMS rare signature tiers, now T0-T4. This is mainly to allow users to fit also the QC amber and red signatures presented in Degasperi et al. 2022, Science
DBS common and rare signatures support added to FitMS

2.2.0

Added functions for rare signature extraction workflow
Added functions for simulating catalogues and for evaluating the performance of signature extractions and fits

2.1.2

Signature fit objects obtained from Fit or FitMS can be passed to the HRDetect pipeline

2.1.1

Updated the HRDetect_pipeline function to use the signatureFit_pipeline function

2.1.0

Added the signatureFit_pipeline function, which is a flexible interface for signature fit analysis
Added the signatureFit command line script, which is a wrapper for the signatureFit_pipeline function

2.0.1

Updated HRDetect pipeline to work with FitMS
Updated DNV catalogues functions

2.0

New signature fit multi-step algorithm, FitMS, from Degasperi et al. 2022, Science
New organ-specific signatures obtained using Genomics England Cancer data, from Degasperi et al. 2022, Science
Reorganisation of old signature fit functions under the new Fit function
Rewrite of plot scripts for displaying signature fit results, now available using plotFit and plotFitMS functions
Added support for trinucleotide variant catalogues
New export functions for converting Fit and FitMS results to JSON

1.0

Signature analysis functions for signature extraction and fit
Mutational signatures available are COSMICv2 and reference signatures from Degasperi et al. 2020, Nature Cancer
HRDetect pipeline available as command line script, includes HRDetect with bootstrap as published in Degasperi et al. 2020, Nature Cancer
Genome plot circos function (fork of Wellcome Sanger Institute genome plot function)
HRD indexes functions written by Nicolai Juul Birkbak

How to cite us

These are the research articles associated with signature.tools.lib:

A. Degasperi et al. Substitution mutational signatures in whole-genome-sequenced cancers in the UK population. Science, doi:10.1126/science.abl9283, 2022.
A. Degasperi et al. A practical framework and online tool for mutational signature analyses show intertissue variation and driver dependencies. Nature Cancer, doi:10.1038/s43018-020-0027-5, 2020.

More in details, these are the specific signatures and algorithms introduced in each publication:

Degasperi et al. 2022, Science: RefSig SBS v2, RefSig DBS v1, FitMS
Degasperi et al. 2020, Nature Cancer: RefSig SBS v1, RefSig Rearrangements (SV) v1, Fit with bootstrap, HRDetect with bootstrap.

Systems Requirements

No special hardware is required to run this software. This is an R package so it will work on any computer with R installed. We recommend to use the latest version of R, as some of the R dependencies constantly increase the minimum version of R required for installation.

Download the signature.tools.lib repository:

git clone https://github.com/Nik-Zainal-Group/signature.tools.lib.git
cd signature.tools.lib

You can install signature.tools.lib by entering the R environment from the main directory and typing:

install.packages("devtools")
devtools::install()

The above commands will take care of all the dependencies. In some cases, it may happen that some dependencies cannot be installed automatically. In this case, an error will indicate which package could not be found, then simply install these packages manually. The installation should not take more than a few minutes.

This is the full list of R package dependencies:

    VariantAnnotation,
    BSgenome.Hsapiens.UCSC.hg38,
    BSgenome.Hsapiens.1000genomes.hs37d5,
    BSgenome.Mmusculus.UCSC.mm10,
    BSgenome.Cfamiliaris.UCSC.canFam3,
    SummarizedExperiment,
    BiocGenerics,
    GenomeInfoDb,
    NMF,
    foreach,
    doParallel,
    lpSolve,
    methods,
    cluster,
    stats,
    NNLM,
    nnls,
    GenSA,
    gmp,
    plyr,
    RCircos,
    scales,
    GenomicRanges,
    IRanges,
    BSgenome, 
    readr,
    doRNG,
    combinat,
    limSolve,
    getopt,
    circlize

We have noticed that the NNLM package is frequently unavailable to download automatically during the R installation, so we have changed the installation process to install NNLM from the github repository linxihui/NNLM.

Testing the package

You can test the package by entering the package main directory and typing from the R environment:

devtools::test()

How to use this package

PLEASE NOTE: project-specific file conversions and filtering should be done before and outside the use of this library. This library should only report to the user the data format errors and suggest how to correct them, while it is the responsibility of the user to supply the necessary data formatted correctly with the necessary features and correct column names.

Package documentation

DOCUMENTATION: Documentation for each of the functions below is provided as R documentation, and it is installed along with the R package. The documentation should give detailed explanation of the input data required, such as a list of data frame columns and their explanation. To access the documentation you can use the ?function syntax in R, for each of the functions below. For example, in R or RStudio, type ?HRDetect_pipeline.

Functions provided by the package

Functions for single nucleotide variants:

vcfToSNVcatalogue(...): converts a VCF file into a SNV catalogue.
tabToSNVcatalogue(...): converts a data frame into a SNV catalogue. The data frame should have the following minimal columns: chr, position, REF, ALT.
plotSubsSignatures(...): function to plot one or more substitution signatures or catalogues.

Functions for rearrangements:

bedpeToRearrCatalogue(...): converts a data frame (loaded from a BEDPE file) into a rearrangement catalogue. If the columns is.clustered or svclass are not present, this function will attempt to compute them.
plotRearrSignatures(...): function to plot one or more rearrangement signatures or catalogues.

Functions for dinucleotide variants:

vcfToDNVcatalogue(...): given a vcf file containing a list of single and double nucleotide variants (SNVs and DNVs), this function initially attempts to find all possible DNVs by checking adjacent SNVs, and then annotates the DNVs and finally generates a DNV catalogue. The catalogue channels are in Zou's style.
tabToDNVcatalogue(...): same as vcfToDNVcatalogue(...) but takes a data frame as input.
convertToAlexandrovChannels(...): Function to convert DNV signatures or catalogues from Zou's to Alexandrov's style.
plotDNVSignatures(...): plot one or more DNV signatures or catalogues, compatible with both Zou's style and Alexandrov's style of channels.

Functions for signature extraction:

SignatureExtraction(...): perform signature extraction, by applying NMF to the input matrix. Multiple NMF runs and bootstrapping is used for robustness, followed by clustering of the solutions. A range of number of signatures to be used is required.
unexplainedSamples(...): identifies samples whose mutational catalogue cannot be fully explained by the extracted common signatures alone.
rareSignatureExtraction(...): extract rare signatures from samples using the samples identified with unexplainedSamples.
finaliseCommonRareSignatureExposures: this function fits all common and rare mutational signatures into the appropriate samples as identified using the unexplainedSamples and rareSignatureExtraction functions.
catalogueClustering(...): cluster mutational catalogues using hierarchical clustering and a cosine similarity distance. Can be used to cluster catalogues, signatures and residuals.

Functions for signature fit:

Fit(...): This is a standard interface for basic signature fit with/without bootstrap. The object returned by this function can be passed to the plotFit() function for automated plotting of the results.
FitMS(...): Given a set of mutational catalogues, this function will attempt fit mutational signature in a multi-step manner. In the first step, only a set of common signatures are fitted into the samples. In the following steps, one or more rare signatures are fitted into the samples in addition to the common signatures. Common and rare signatures can be determined automatically by providing the name of an organ, or can be supplied by the user. The object returned by this function can be passed to the plotFitMS() function for automated plotting of the results. A manual for FitMS can be found in the userManuals folder.
getSignaturesForFitting: given a tissue type/organ, this function returns the common and rare mutational signatures to be used with FitMS
signatureFit_pipeline: an interface for the Fit and FitMS functions, which aim to automate various signature fit analysis steps, like generating the mutational catalogues and selecting which mutational signatures to fit. This function can be accessed via command line using the signatureFit script in the scripts folder.
plotFitResults(...): this function can be used to plot result objects from both Fit and FitMS functions. The object type will be inferred automatically and either plotFit() or plotFitMS() will be used.
writeFitResultsToJSON(...): this function can be used to write to file the content of the result objects from both Fit and FitMS functions as a compressed JSON file.

Functions for organ-specific signatures and exposures conversion:

getOrganSignatures(...): returns organ-specific signatures of a requested organ and mutation type.
convertExposuresFromOrganToRefSigs(...): after signature fit using organ-specific signatures, use this function to apply the conversion matrix, to convert organ-specific signature exposures into reference signature exposures.

Functions for HRD indexes:

ascatToHRDLOH(...): compute the HRD-LOH index from a data frame formatted similarly to an ASCAT output file. This is a wrapper for the function calc.hrd, with a simplified interface, where only a data frame is requested.
plotCopyNumbers(...): Plot the copy numbers across the Chromosomes. Optionally, plot also highlight regions at the bottom. For example, could be used to highlight HRD-LOH regions..
calc.hrd(...): compute HRD-LOH (Loss of Heterozygosity), written by Nicolai Juul Birkbak
calc.ai(...): compute HRD-NtAI (Number of telomeric Allelic Imbalances), written by Nicolai Juul Birkbak
calc.lst(...): compute HRD-LST (Large-scale state transitions), written by Nicolai Juul Birkbak

Functions for indels classification:

vcfToIndelsClassification(...): converts a VCF file containing indels into a data frame where the indels are classified. Also returns a summary of count and proportion of the various classes of indels.
tabToIndelsClassification(...): converts a data frame containing indels into a data frame where the indels are classified. Also returns a summary of count and proportion of the various classes of indels.

Functions for HRDetect:

HRDetect_pipeline(...): this is a flexible interface to the HRDetect pipeline to compute the HRDetect BRCAness probability score, as published in Davies et al. 2017.
applyHRDetectDavies2017(...): given a data frame with samples as rows and features as columns, this function will compute the HRDetect BRCAness probability for each sample. The following six features should be included in the matrix: 1) proportion of deletions with micro-homology, 2) number of mutations of substitution signature 3, 3) number of mutations of rearrangement signature 3, 4) number of mutations of rearrangement signature 5, 5) HRD LOH index, 6) number of mutations of substitution signature 8.
plot_HRDLOH_HRDetect_Contributions(...): uses the HRDetect_pipeline output to generate a figure with three plots: the HDR-LOH index for each sample, the HRDetect BRCAness probability score for each sample, and the contribution of each of the six features to the HRDetect BRCAness probability score.
plot_HRDetect_overall(...): uses the HRDetect_pipeline output to generate an overall plot of the HRDetect BRCAness probability score.
plot_HRDetect_BootstrapScores(...): overall plot of scores obtained from the HRDetect_pipeline output when HRDetect with bootstrap is enabled.

Function for data visualisation:

genomePlot(...): generates a plot for the visualisation of somatic variants across the genome, organised in a circle. Variants plotted are single nucleotide variations (SNV), small insertions and deletions (indels), copy number variations (CNV) and rearrangements.
genomeChart(...): complete rewrite of the genomePlot function using the circlize R package
plotSignatures(...): this function will plot signatures or catalogues trying to identify the appropriate mutation type (SNV, DNV, SV...) from the input row names.

Functions for catalogue simulations and performance evaluation:

simpleCatalogueSimulation(...): simulates a dataset with the requested number of samples, given a set of common and rare mutational signatures. A matrix of mutational catalogues and corresponding signature exposures is generated. The functions saveSimulatetdData and loadSimulatedData can be used for saving and loading the simulated data object.
evaluatePerformanceSignatureSimilarity: compare a signatures matrix of true signatures with a matrix of estimated signatures, and compute a match between the signatures and their cosine similarity. Multiple matrices of estimated signatures can be compared to the true signatures at once using the evaluatePerformanceSignatureSimilarityList function.
evaluatePerformanceExposures(...): compare an exposure matrix of true exposures with a matrix of estimated exposures, and compute various metrics like sensitivity, specificity, F1 score and so on. Multiple matrices of estimated exposures can be compared to the true exposures at once using the evaluatePerformanceExposuresList function.

Command line scripts

We provide some command line scripts, which can be used instead of writing your own R code. Make sure you have installed signature.tools.lib before attempting to run them. You can find these scripts in the scripts directory.

Currently available scripts are:

signatureFit: mutational signatures analysis using Fit or FitMS. This is a wrapper for the signatureFit_pipeline R function.
hrDetect: HRDetect pipeline script. This is a wrapper for the HRDetect_pipeline R function.
genomeChart: visualisation of somatic variants using a circle diagram and other graphs like mutational catalogues.
solutionSelectionForFitMS: change selection criteria for FitMS alternative rare signature solutions and/or select solutions manually.

You can find user manuals for these command line scripts with detailed explanation of parameters and examples in the userManuals folder.

Examples

Test examples

A good place to look for examples is the tests/testthat/ directory, where for each function of the package we provide a test using test data. These are the tests that are run when running:

devtools::test()

Moreover, examples of typical workflows are given below.

Example 01: signature fit and HRDetect

In this example we illustrate a typical workflow for signature fit and HRDetect using two test samples. This code can be easily adapted to be used on your data, provided you have formatted the input data as described in the ?function documentation.

This example, along with expected output files, can be found in the examples/Example01/ directory.

We begin by setting variables containing file names. These should be vectors indexed by sample names.

#set directory to this file location
#
#import the package
library(signature.tools.lib)

#set sample names
sample_names <- c("sample1","sample2")

#set the file names. 
SNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.snv.simple.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.snv.simple.txt")
SV_bedpe_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.sv.bedpe",
                    "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.sv.bedpe")
Indels_vcf_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.indel.vcf.gz",
                      "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.indel.vcf.gz")
CNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.cna.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.cna.txt")

#name the vectors entries with the sample names
names(SNV_tab_files) <- sample_names
names(SV_bedpe_files) <- sample_names
names(Indels_vcf_files) <- sample_names
names(CNV_tab_files) <- sample_names

We can now load the SNV tab data and build the SNV mutational catalogues.

#load SNV data and convert to SNV mutational catalogues
SNVcat_list <- list()
for (i in 1:length(SNV_tab_files)){
  tmpSNVtab <- read.table(SNV_tab_files[i],sep = "\t",
                          header = TRUE,check.names = FALSE,
                          stringsAsFactors = FALSE)
  #convert to SNV catalogue, see ?tabToSNVcatalogue or
  #?vcfToSNVcatalogue for details
  res <- tabToSNVcatalogue(subs = tmpSNVtab,genome.v = "hg19")
  colnames(res$catalogue) <- sample_names[i]
  SNVcat_list[[i]] <- res$catalogue
}
#bind the catalogues in one table
SNV_catalogues <- do.call(cbind,SNVcat_list)

At this point you can plot the mutational catalogues and compare them with the expected output files in the Example01 directory.

#the catalogues can be plotted as follows
plotSubsSignatures(signature_data_matrix = SNV_catalogues,
                   plot_sum = TRUE,output_file = "SNV_catalogues.pdf")

We can now perform signature fit analysis, i.e. identify which mutational signatures are present in the sample. We suggest to use a small a priori set of signatures, for example here we use the signatures identified in breast cancer.

#fit the 12 breast cancer signatures using the bootstrap signature fit approach
sigsToUse <- c(1,2,3,5,6,8,13,17,18,20,26,30)
subs_fit_res <- Fit(catalogues = SNV_catalogues,
                    signatures = COSMIC30_subs_signatures[,sigsToUse],
                    useBootstrap = TRUE,
                    nboot = 100,
                    nparallel = 4)
plotFit(subs_fit_res,outdir = "signatureFit/")

#The signature exposures can be found here and correspond to the median
#of the bootstrapped runs followed by false positive filters. See
#?Fit for details
snv_exp <- subs_fit_res$exposures

In this case we used the Fit function with useBootstrap=TRUE, which uses the bootstrap fitting method, and the plotFit function, which provides several plots that can be compared to the expected output plots in the Example01 directory.

Finally, we can apply HRDetect on these two samples. Notice that The HRDetect pipeline allows us to specify the amount of SNV Signature 3 and 8 that we have already estimated above, so we only need to supply the additional files to compute indels classification, rearrangements and the copy number based score HRD-LOH. Please see the documentation in ?HRDetect_pipeline for more details.

#The HRDetect pipeline will compute the HRDetect probability score for
#the samples to be Homologous Recombination Deficient. HRDetect is a
#logistic regression classifier that requires 6 features to compute the
#probability score. These features can be supplied directly in an input
#matrix, or pipeline can compute these features for you if you supply
#the file names. It is possible to supply a mix of features and file
#names.

#Initialise feature matrix
col_hrdetect <- c("del.mh.prop", "SNV3", "SV3", "SV5", "hrd", "SNV8")
input_matrix <- matrix(NA,nrow = length(sample_names),
                       ncol = length(col_hrdetect),
                       dimnames = list(sample_names,col_hrdetect))

#We have already quantified the amount of SNV signatures in the samples,
#so we can supply these via the input matrix
input_matrix[rownames(snv_exp),"SNV3"] <- snv_exp[,"Signature3"]
input_matrix[rownames(snv_exp),"SNV8"] <- snv_exp[,"Signature8"]

#run the HRDetect pipeline, for more information see ?HRDetect_pipeline
res <- HRDetect_pipeline(input_matrix,
                         genome.v = "hg19",
                         signature_type = "COSMICv2",
                         SV_bedpe_files = SV_bedpe_files,
                         Indels_vcf_files = Indels_vcf_files,
                         CNV_tab_files = CNV_tab_files,
                         nparallel = 2)

#save HRDetect scores
writeTable(res$hrdetect_output,file = "HRDetect_res.tsv")

Also in this case, you can compare your output with the expected output in the Example01 directory.

Example 02: signature fit of organ-specific signatures

In this example we illustrate a typical workflow for signature fit using organ-specific signatures. This code can be easily adapted to be used on your data, provided you have formatted the input data as described in the ?function documentation.

This example, along with expected output files, can be found in the examples/Example02/ directory.

We begin by setting variables containing file names. These should be vectors indexed by sample names.

#set directory to this file location

#import the package
library(signature.tools.lib)

#set sample names
sample_names <- c("sample1","sample2")

#set the file names. 
SNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.snv.simple.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.snv.simple.txt")

#name the vectors entries with the sample names
names(SNV_tab_files) <- sample_names

We can now load the SNV tab data and build the SNV mutational catalogues.

#load SNV data and convert to SNV mutational catalogues
SNVcat_list <- list()
for (i in 1:length(SNV_tab_files)){
  tmpSNVtab <- read.table(SNV_tab_files[i],sep = "\t",header = TRUE,
                          check.names = FALSE,stringsAsFactors = FALSE)
  #convert to SNV catalogue, see ?tabToSNVcatalogue or ?vcfToSNVcatalogue for details
  res <- tabToSNVcatalogue(subs = tmpSNVtab,genome.v = "hg19")
  colnames(res$catalogue) <- sample_names[i]
  SNVcat_list[[i]] <- res$catalogue
}
#bind the catalogues in one table
SNV_catalogues <- do.call(cbind,SNVcat_list)

Then we can plot the catalogues, in this case using pdf format.

#the catalogues can be plotted as follows
plotSubsSignatures(signature_data_matrix = SNV_catalogues,
                   plot_sum = TRUE,output_file = "SNV_catalogues.pdf")

Than we can perform signature fit using organ specific signatures, here breast cancer signatures. Typing ?getOrganSignatures will show more information, for example which organs are available.

#fit the organ-specific breast cancer signatures using the bootstrap signature fit approach
sigsToUse <- getOrganSignatures("Breast",typemut = "subs")
subs_fit_res <- Fit(catalogues = SNV_catalogues,
                    signatures = sigsToUse,
                    useBootstrap = TRUE,
                    nboot = 100,
                    nparallel = 4)
plotFit(subs_fit_res,outdir = "signatureFit/")

#The signature exposures can be found here and correspond to the median of the bootstrapped
#runs followed by false positive filters. See ?Fit for details
snv_exp <- subs_fit_res$exposures

We can now convert the organ-specific signature exposures into reference signature exposures. This allows us to compare results across different organs, and to perform additional analysis such as HRDetect, where RefSig 3 and RefSig 8 exposures can be used in place of COSMIC signatures 3 and 8 exposures.

#Convert the organ-specific signature exposures into reference signature exposures
snv_exp <- convertExposuresFromOrganToRefSigs(expMatrix = t(snv_exp[,1:(ncol(snv_exp)-1)]),typemut = "subs")

#write the results
writeTable(snv_exp,"RefSigSubsExposures.tsv")

Example 03: signature fit using FitMS

In this example we show how to use the multi-step signature fit function along with the Gini-based exposure filter.

Similarly to Examples 01 and 02, we construct the mutational catalogues and then we just need to specify the organ of interest:

#perform signature fit using a multi-step approach where organ-specific
#common and rare signatures are used
subs_fit_res <- FitMS(catalogues = SNV_catalogues,
                      exposureFilterType = "giniScaledThreshold",
                      useBootstrap = TRUE,
                      organ = "Breast")
plotFitMS(subs_fit_res,outdir = "signatureFit/")

The function FitMS will select automatically the common and rare signatures to use according to the specified organ. As a first step it will fit the common signatures, and as a second step it will attempt to determine the presence of the rare signatures.

In this example we have also specified to use bootstrap and to use the giniScaledThreshold exposure filter method. Results can be plotted with the plotFitMS function.

In addition, it is possible to save all the data in the subs_fit_res object as a JSON file:

writeFitResultsToJSON(fitObj = subs_fit_res,
                      filename = "FitMSresults.json")

A manual for FitMS can be found in the userManuals folder.

Example 04: signature fit using signatureFit_pipeline

Here we provide an example of using the signatureFit_pipeline function introduced in v2.1.0. This function is meant to automate some recurrent tasks in signature analysis, such as catalogues generation and selection of signatures to fit.

Similarly to the examples above, we store the location of the files containing the single nucleotide variants into the SNV_tab_files variable:

#set sample names
sample_names <- c("sample1","sample2")
#set the file names.
SNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.snv.simple.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.snv.simple.txt")
#name the vectors entries with the sample names
names(SNV_tab_files) <- sample_names

Then, we simply call the signatureFit_pipeline function:

pipeline_subs_res <- signatureFit_pipeline(SNV_tab_files = SNV_tab_files,
                                           organ = "Breast",genome.v = "hg19",
                                           fit_method = "FitMS",nparallel = 2)

In one function call, the SNV catalogues have been generated, and FitMS has been applied. We can now save the catalogues, as well as saving the annotated mutations and plotting the results from FitMS:

#save catalogues plots
plotSignatures(pipeline_subs_res$catalogues,
               output_file = "SNV_catalogues.pdf",
               ncolumns = 2)
#write annotated SNVs to file
writeTable(pipeline_subs_res$annotated_mutations,"annotated_SNVs.tsv")
#plot the FitMS results
plotFitResults(pipeline_subs_res$fitResults,outdir = "signatureFit/")

The function plotSignatures will infer the type of mutations and select the correct signatures plot function. In this case, plotSignatures will notice that these are SNV catalogues and use the plotSubsSignatures function. Moreover, the function plotFitResults will infer the fit method (Fit or FitMS) and use the appropriate plot function, in this case plotFitMS.

The signatureFit script that can be found in the scripts folder is a command line wrapper for the signatureFit_pipeline function.

Example 05: common and rare signature extraction workflow

In this example, we provide an example of signature extraction workflow where both common and rare signatures are extracted. For more details, please see the rare signatures extraction manual in the userManuals folder.

We being by importing the package and setting an output directory where we will store all the results.

#import the package
library(signature.tools.lib)
# set an output directory
outdir <- "~/Example05/"
dir.create(outdir,showWarnings = F,recursive = T)

We load the catalogue and perform a clustering to see if samples with rare profiles can be identified and excluded from the extraction of common signatures.

# these are simulated data with 100 catalogues (9 common and 2 rare signatures)
catalogues <- readTable("tests/testthat/rareExtraction/SDExample05/catalogues.tsv")
# we begin by exploring the data using clustering
resCl <- cataloguesClustering(catalogues,nclusters = 1:15,
                              outdir = paste0(outdir,"cataloguesClustering/"))

After inspection of the clustering results, we can clearly identify one cluster with just two samples that we can exclude (nclusters=5, remove cluster 5). Then, we perform a signature extraction using NMF on the remaining samples.

# Let's try to remove samples that may have a rare signatures first
nclustersSelection <- "5"
clusters_to_keep <- c(1:4)
samplesSelected <- rownames(resCl$clusters_table)[resCl$clusters_table[,nclustersSelection] %in% clusters_to_keep]
#now extract common signatures
SignatureExtraction(cat = catalogues[,samplesSelected],
                    outFilePath = paste0(outdir,"Extraction/"),
                    nrepeats = 25,nboots = 4,filterBestOfEachBootstrap = T,
                    nparallel = 4,nsig = 8:11,plotResultsFromAllClusteringMethods = F,
                    parallel = T)

We now need to decide how many signatures are present, based on the error and the average silhouette width. In this case, 9 or 10 signatures seem optimal. Let us choose 9 here, though feel free to try 10 as well and see how the results are affected downstream.

# load the optimal signatures (9)
estimated_signatures <- readTable(paste0(outdir,"Extraction/round_1/sig_9/Sigs_plot_extraction_ns9_nboots4.tsv"))

Now we have a set of common signatures and we want to find whether some samples are still not fully explained by these signatures and see if we can extract rare signatures out of them.

# is there any sample that was not fully explained by the signatures we found?
unexplSamples <- unexplainedSamples(outfileRoot = paste0(outdir,"unexplained/Example05"),
                                    catalogues = catalogues,
                                    sigs = estimated_signatures,
                                    nmuts_threshold = 300,
                                    pvalue_threshold = 0.15)

Indeed there are four samples that are not fully explained. As there is some randomness you might have to tune the nmuts_threshold and pvalue_threshold parameters to select them. Use the Example05_UnexplainedCataloguesSelectionPlot.pdf scatter plot in the unexplained folder to guide you.

We cluster the residuals of the four samples, and we identify two clusters of two samples, each of them having very similar residuals. We can use the samples in these clusters to then extract two rare mutational signatures.

# yes, 4 samples, get their residual
significant_residuals <- unexplSamples$all_residuals[,unexplSamples$which_significant]
# now we cluster the residuals
resCl_residuals <- cataloguesClustering(significant_residuals,
                                        nclusters = 1:3,
                                        outdir = paste0(outdir,"residualClustering/"))
# we decide that there are two clusters, having two samples in each
# now we need to extract the rare signatures
resRareSigs <- rareSignatureExtraction(outfileRoot = paste0(outdir,"ExtractionRare/Example05"),
                                       catalogues = catalogues,
                                       residuals = unexplSamples$all_residuals,
                                       unexpl_samples = unexplSamples$unexplSamples,
                                       clusters = resCl_residuals$clusters_table[,"2"],
                                       useclusters = list(c(1),c(2)),
                                       commonSignatures = estimated_signatures,
                                       commonExposures = unexplSamples$exposures)

Finally, we can use all the common and rare signatures, as well as the information about where we extracted the rare signatures from, to estimate the exposures of each signature in each sample.

# now let's get the exposures
resFinalExpo <- finaliseCommonRareSignatureExposures(outfileRoot = paste0(outdir,"ExtractionRare/Example05"),
                                                     catalogues = catalogues,
                                                     commonSigs = estimated_signatures,
                                                     listofsignatures = resRareSigs$listofsignatures,
                                                     listofsamples = resRareSigs$listofsamples,
                                                     nboot = 50,nparallel = 4)

Frequently Asked Questions

Can I use your package on my Whole Exome Sequenced data?

Mostly, no. All algorithms included in this package, from signature extraction, to signature fit, to HRDetect, have been developed and tested using only Whole Genome Sequenced data, and are meant to be used only on Whole Genome Sequenced data. These are some of the reasons why:

WES have 100x less mutations per sample than WGS
Using our bootstrap based fitting on WES is not ideal, because bootstrapping WES data can change the profile a lot
Fitting WGS signatures into WES samples, where the frequency of the trinucleotide contexts are different can also affect the results
You can convert the WGS signatures into WES by adjusting for trinucleotide context frequency, but the result is far from ideal. It would be better to have signatures extracted from WES to fit into WES samples
If you then plan to extract WES signatures, this may also be challenging possibly avoiding bootstrapping here as well, and most likely requiring a very large number of WES samples, compared to how many WGS samples are needed to extract WGS signatures
In the case of HRDetect, the parameters of the classifier have been tuned to Whole Genome data, so it would require retraining using WES data. We did that, and while the performance of a WES HRD classifier was not too bad on breast cancer, we simply discourage to use this for other tissue types. It is just bound to be very unreliable, because the classification would be supported mainly by signature 3 exposures and the number of deletions with microhomology. Fitting signature 3 to WES can be very unreliable, especially in non breast cancer samples, and also the number of indels is usually very low for each WES sample, so one false positive or false negative deletion can make the difference. On top of this, without a standardised data quality control, mutation calling and filtering, there are just too many variables to make this worthwhile, in our opinion. On the other hand, we have plenty of evidence that HRDetect, trained on WGS data using six different mutational signatures, is robust and reliable.

This said, you may still achieve something decent if you limit yourself to breast cancer, or other organs where we know fairly well what signatures to expect. And you should be able to spot most MMR samples and other hypermutated samples, simply because they have a much higher number of SNVs and Indels.

In R/HRDetect.R, the function applyHRDetectDavies2017 seems to have mean/sd values hardcoded for each input feature. This results in input data not being standardised to mean=0/sd=1. Is this intentional? I'm assuming these values come from the results for the 371 samples in the 2017 paper?

Yes it is intentional. In order for the parameters of the linear model to be meaningful the new input data has to be processed in the same way as the training data, which means that the same mean and sd of the training data should be provided. Please check the R documentation of the function for the details of how to input your data.

We have a set of samples from non-BRCA cancers, sequenced on a Novaseq, and aligned/called through our in-house pipeline. Are these differences in sequencing/processing large enough to render the parameters of the HRDetect model meaningless?

Before running HRDetect we make sure that the data is of a good quality and try to reduce artefacts and other false positives. We mostly worked with Sanger cgp pipeline, and also adapted output from other pipelines sometimes with filters to get the same level of specificity, and found that HRDetect is quite robust. What I mean is that rather than retraining HRDetect for different pipelines, we usually prefer to work on the pipeline until we are happy with the calls. I can advise to have a look at samples in the same tumour type from highly curated resources like PCAWG. You can browse some at https://signal.mutationalsignatures.com.

Should I use COSMIC signatures or tissue-specific signatures with my data? Which will give more accurate signature assignment?

In general, we expect tissue-specific signatures to be more accurate, as we have also demonstrated using simulations in Degasperi et al. 2022, Science, Figure S53, though in practice it may depend on the tumour type. Some of the tumour types have more reliable signatures than others (e.g. due to sample size, number of signatures present, similarity between signatures present...). Please double check on SIGNAL that you are happy with the signatures of your cancer type. From version 2.0 of signature.tools.lib, the signature fit algorithm FitMS automatically selects suitable common and rare signatures for fitting if the parameter organ is used. From version 2.1.0, the signatureFit_pipeline also allows the automatic selection of signatures to fit based on tumour types, and also allows to request COSMIC signatures. If COSMIC signatures are used, we advise to select a subset of the COSMIC signatures that are believed to be present in the tumour type to be analysed, rather than attempting to fit all signatures at once.

How do I interpret the Overall Metrics plot that I obtain from SignatureExtraction? What are all these metrics?

During the investigation of signature extraction methods, we have defined multiple useful metrics. Some of these were not reported in Degasperi et al. 2020, Nature Cancer, as we did not find them essential. The metrics are:

norm.Error: this is the normalised average error between the model and the bootstrapped catalogue. This error is always decreasing as the number of signatures extracted increases.
norm.Error (orig. cat.): this is the normalised average error between the model and the original catalogue. This error will eventually increase when the number of signatures used is too large, as the algorithm overfits the bootstrapped catalogues.
Ave.SilWid: this is the average silhouette width, which indicates a good the clustering is. The value will be affected by the clustering algorithm used. The Ave.SilWid will begin decreasing when highly similar signatures are extracted, possibly indicating that one is trying to use more signatures than there are actually present in the data.
mmcs: minimum medoid cosine similarity. This indicates the cosine similarity between the two most similar medoids of the clusters. This is in fact the similarity of the two most similar signatures.
min.Min.WCCS: minimum of the minimum within cluster cosine similarity. This is a goodness of clustering metric. We compute the cosine similarity between the members of each cluster, and for each cluster we report the minimum cosine similarity. Finally we report the minimum of the minimum cosine similarities, which indicates how different the signatures can be in the least homogeneous (most spread) cluster. This value becomes low when at least one cluster is beginning to include very different signatures, possibly indicating complex solutions (see PCA rings in Extended Data Figure 1, in Degasperi et al. 2020, Nature Cancer).
max.Max.BCCS: maximum of the maximum between cluster cosine similarity. This is a goodness of clustering metric. We compute the cosine similarity between the members of different clusters, and for each cluster pairs we report the maximum cosine similarity. Finally we report the maximum of the maximum cosine similarities, which indicates how similar the signatures can be when they belong to different clusters. This value becomes high when two clusters are beginning to include very similar signatures, possibly indicating complex solutions (see PCA rings in Extended Data Figure 1, in Degasperi et al. 2020, Nature Cancer).

While it can be interesting to look at all these metrics, we recommend to mostly use norm.Error (orig. cat.) and Ave.SilWid, using a suitable number of bootstraps and repeats and with the Clustering with Matching (MC) clustering algorithm, as described in Degasperi et al. 2020, Nature Cancer.

What parameters should I use for signature extraction?

It depends on how many samples you have, with more samples requiring more repeats. In general, we advise to use 20 bootstraps, 200 repeats, the clustering with matching algorithm (CM), the KLD objective function (nmfmethod brunet) and RTOL=0.001. The number of repeats (nrepeats) is more important than the number of bootstraps, because at least 100 repeats are necessary to obtain enough results to then select best runs using the RTOL threshold (Extended Data Figure 1 of Degasperi et al. 2020, Nature Cancer). The disadvantage of this method is that this can increase the computation time required significantly if one uses more than 500/1000 samples.

Where do I find the activity (exposures) matrix after running the signature extraction?

We do not report the activity matrix from the extraction, only the signatures. We then use a signature fit procedure (here the function Fit or FitMS) to estimate the activities. We tend to think of the estimation of the activities as a separate procedure, performed holding a given set of a priori signatures fixed. This allows for more playing around with the signature fit, for example excluding clear false positive signatures from the fit.

Which Structural Variants caller and filters should I use?

Our team has been using mostly Manta and Brass. For Manta, we tend to tune false positives using the PR column.

https://github.com/Illumina/manta/tree/master/docs/userGuide

After running Manta you will have to convert your output to bedpe using a vcf to bedpe converter. There are a few out there, you can try this one:

https://github.com/arq5x/lumpy-sv/blob/master/scripts/vcfToBedpe

Nik-Zainal-Group / signature.tools.lib

readme