Nik-Zainal-Group / signature.tools.lib

R package containing useful functions for mutational signature analysis
Other
80 stars 26 forks source link

signature.tools.lib R package

Table of content

Introduction to the package

signature.tools.lib is an R package for mutational signatures analysis. Mutational signatures are patterns of somatic mutations that reveal what mutational processes have been active in a cell. Mutational processes can be caused by exposure to mutagens, such as chemicals present in cigarettes, or defects in DNA repair pathways, such as homologous recombination repair.

The package supports reference genomes hg19, hg38 and mm10. It provides our latest algorithms for signature fit and extraction, as well as various utility functions and the HRDetect pipeline. The list and description of the most important functions is given below.

Versions

2.4.5

2.4.4

2.4.3

2.4.2

2.4.1

2.4.0

2.3.0

2.2.0

2.1.2

2.1.1

2.1.0

2.0.1

2.0

1.0

How to cite us

These are the research articles associated with signature.tools.lib:

More in details, these are the specific signatures and algorithms introduced in each publication:

Systems Requirements

No special hardware is required to run this software. This is an R package so it will work on any computer with R installed. We recommend to use the latest version of R, as some of the R dependencies constantly increase the minimum version of R required for installation.

Download the signature.tools.lib repository:

git clone https://github.com/Nik-Zainal-Group/signature.tools.lib.git
cd signature.tools.lib

You can install signature.tools.lib by entering the R environment from the main directory and typing:

install.packages("devtools")
devtools::install()

The above commands will take care of all the dependencies. In some cases, it may happen that some dependencies cannot be installed automatically. In this case, an error will indicate which package could not be found, then simply install these packages manually. The installation should not take more than a few minutes.

This is the full list of R package dependencies:

    VariantAnnotation,
    BSgenome.Hsapiens.UCSC.hg38,
    BSgenome.Hsapiens.1000genomes.hs37d5,
    BSgenome.Mmusculus.UCSC.mm10,
    BSgenome.Cfamiliaris.UCSC.canFam3,
    SummarizedExperiment,
    BiocGenerics,
    GenomeInfoDb,
    NMF,
    foreach,
    doParallel,
    lpSolve,
    methods,
    cluster,
    stats,
    NNLM,
    nnls,
    GenSA,
    gmp,
    plyr,
    RCircos,
    scales,
    GenomicRanges,
    IRanges,
    BSgenome, 
    readr,
    doRNG,
    combinat,
    limSolve,
    getopt,
    circlize

We have noticed that the NNLM package is frequently unavailable to download automatically during the R installation, so we have changed the installation process to install NNLM from the github repository linxihui/NNLM.

Testing the package

You can test the package by entering the package main directory and typing from the R environment:

devtools::test()

How to use this package

PLEASE NOTE: project-specific file conversions and filtering should be done before and outside the use of this library. This library should only report to the user the data format errors and suggest how to correct them, while it is the responsibility of the user to supply the necessary data formatted correctly with the necessary features and correct column names.

Package documentation

DOCUMENTATION: Documentation for each of the functions below is provided as R documentation, and it is installed along with the R package. The documentation should give detailed explanation of the input data required, such as a list of data frame columns and their explanation. To access the documentation you can use the ?function syntax in R, for each of the functions below. For example, in R or RStudio, type ?HRDetect_pipeline.

Functions provided by the package

Functions for single nucleotide variants:

Functions for rearrangements:

Functions for dinucleotide variants:

Functions for signature extraction:

Functions for signature fit:

Functions for organ-specific signatures and exposures conversion:

Functions for HRD indexes:

Functions for indels classification:

Functions for HRDetect:

Function for data visualisation:

Functions for catalogue simulations and performance evaluation:

Command line scripts

We provide some command line scripts, which can be used instead of writing your own R code. Make sure you have installed signature.tools.lib before attempting to run them. You can find these scripts in the scripts directory.

Currently available scripts are:

You can find user manuals for these command line scripts with detailed explanation of parameters and examples in the userManuals folder.

Examples

Test examples

A good place to look for examples is the tests/testthat/ directory, where for each function of the package we provide a test using test data. These are the tests that are run when running:

devtools::test()

Moreover, examples of typical workflows are given below.

Example 01: signature fit and HRDetect

In this example we illustrate a typical workflow for signature fit and HRDetect using two test samples. This code can be easily adapted to be used on your data, provided you have formatted the input data as described in the ?function documentation.

This example, along with expected output files, can be found in the examples/Example01/ directory.

We begin by setting variables containing file names. These should be vectors indexed by sample names.

#set directory to this file location
#
#import the package
library(signature.tools.lib)

#set sample names
sample_names <- c("sample1","sample2")

#set the file names. 
SNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.snv.simple.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.snv.simple.txt")
SV_bedpe_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.sv.bedpe",
                    "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.sv.bedpe")
Indels_vcf_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.indel.vcf.gz",
                      "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.indel.vcf.gz")
CNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.cna.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.cna.txt")

#name the vectors entries with the sample names
names(SNV_tab_files) <- sample_names
names(SV_bedpe_files) <- sample_names
names(Indels_vcf_files) <- sample_names
names(CNV_tab_files) <- sample_names

We can now load the SNV tab data and build the SNV mutational catalogues.

#load SNV data and convert to SNV mutational catalogues
SNVcat_list <- list()
for (i in 1:length(SNV_tab_files)){
  tmpSNVtab <- read.table(SNV_tab_files[i],sep = "\t",
                          header = TRUE,check.names = FALSE,
                          stringsAsFactors = FALSE)
  #convert to SNV catalogue, see ?tabToSNVcatalogue or
  #?vcfToSNVcatalogue for details
  res <- tabToSNVcatalogue(subs = tmpSNVtab,genome.v = "hg19")
  colnames(res$catalogue) <- sample_names[i]
  SNVcat_list[[i]] <- res$catalogue
}
#bind the catalogues in one table
SNV_catalogues <- do.call(cbind,SNVcat_list)

At this point you can plot the mutational catalogues and compare them with the expected output files in the Example01 directory.

#the catalogues can be plotted as follows
plotSubsSignatures(signature_data_matrix = SNV_catalogues,
                   plot_sum = TRUE,output_file = "SNV_catalogues.pdf")

We can now perform signature fit analysis, i.e. identify which mutational signatures are present in the sample. We suggest to use a small a priori set of signatures, for example here we use the signatures identified in breast cancer.

#fit the 12 breast cancer signatures using the bootstrap signature fit approach
sigsToUse <- c(1,2,3,5,6,8,13,17,18,20,26,30)
subs_fit_res <- Fit(catalogues = SNV_catalogues,
                    signatures = COSMIC30_subs_signatures[,sigsToUse],
                    useBootstrap = TRUE,
                    nboot = 100,
                    nparallel = 4)
plotFit(subs_fit_res,outdir = "signatureFit/")

#The signature exposures can be found here and correspond to the median
#of the bootstrapped runs followed by false positive filters. See
#?Fit for details
snv_exp <- subs_fit_res$exposures

In this case we used the Fit function with useBootstrap=TRUE, which uses the bootstrap fitting method, and the plotFit function, which provides several plots that can be compared to the expected output plots in the Example01 directory.

Finally, we can apply HRDetect on these two samples. Notice that The HRDetect pipeline allows us to specify the amount of SNV Signature 3 and 8 that we have already estimated above, so we only need to supply the additional files to compute indels classification, rearrangements and the copy number based score HRD-LOH. Please see the documentation in ?HRDetect_pipeline for more details.

#The HRDetect pipeline will compute the HRDetect probability score for
#the samples to be Homologous Recombination Deficient. HRDetect is a
#logistic regression classifier that requires 6 features to compute the
#probability score. These features can be supplied directly in an input
#matrix, or pipeline can compute these features for you if you supply
#the file names. It is possible to supply a mix of features and file
#names.

#Initialise feature matrix
col_hrdetect <- c("del.mh.prop", "SNV3", "SV3", "SV5", "hrd", "SNV8")
input_matrix <- matrix(NA,nrow = length(sample_names),
                       ncol = length(col_hrdetect),
                       dimnames = list(sample_names,col_hrdetect))

#We have already quantified the amount of SNV signatures in the samples,
#so we can supply these via the input matrix
input_matrix[rownames(snv_exp),"SNV3"] <- snv_exp[,"Signature3"]
input_matrix[rownames(snv_exp),"SNV8"] <- snv_exp[,"Signature8"]

#run the HRDetect pipeline, for more information see ?HRDetect_pipeline
res <- HRDetect_pipeline(input_matrix,
                         genome.v = "hg19",
                         signature_type = "COSMICv2",
                         SV_bedpe_files = SV_bedpe_files,
                         Indels_vcf_files = Indels_vcf_files,
                         CNV_tab_files = CNV_tab_files,
                         nparallel = 2)

#save HRDetect scores
writeTable(res$hrdetect_output,file = "HRDetect_res.tsv")

Also in this case, you can compare your output with the expected output in the Example01 directory.

Example 02: signature fit of organ-specific signatures

In this example we illustrate a typical workflow for signature fit using organ-specific signatures. This code can be easily adapted to be used on your data, provided you have formatted the input data as described in the ?function documentation.

This example, along with expected output files, can be found in the examples/Example02/ directory.

We begin by setting variables containing file names. These should be vectors indexed by sample names.

#set directory to this file location

#import the package
library(signature.tools.lib)

#set sample names
sample_names <- c("sample1","sample2")

#set the file names. 
SNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.snv.simple.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.snv.simple.txt")

#name the vectors entries with the sample names
names(SNV_tab_files) <- sample_names

We can now load the SNV tab data and build the SNV mutational catalogues.

#load SNV data and convert to SNV mutational catalogues
SNVcat_list <- list()
for (i in 1:length(SNV_tab_files)){
  tmpSNVtab <- read.table(SNV_tab_files[i],sep = "\t",header = TRUE,
                          check.names = FALSE,stringsAsFactors = FALSE)
  #convert to SNV catalogue, see ?tabToSNVcatalogue or ?vcfToSNVcatalogue for details
  res <- tabToSNVcatalogue(subs = tmpSNVtab,genome.v = "hg19")
  colnames(res$catalogue) <- sample_names[i]
  SNVcat_list[[i]] <- res$catalogue
}
#bind the catalogues in one table
SNV_catalogues <- do.call(cbind,SNVcat_list)

Then we can plot the catalogues, in this case using pdf format.

#the catalogues can be plotted as follows
plotSubsSignatures(signature_data_matrix = SNV_catalogues,
                   plot_sum = TRUE,output_file = "SNV_catalogues.pdf")

Than we can perform signature fit using organ specific signatures, here breast cancer signatures. Typing ?getOrganSignatures will show more information, for example which organs are available.

#fit the organ-specific breast cancer signatures using the bootstrap signature fit approach
sigsToUse <- getOrganSignatures("Breast",typemut = "subs")
subs_fit_res <- Fit(catalogues = SNV_catalogues,
                    signatures = sigsToUse,
                    useBootstrap = TRUE,
                    nboot = 100,
                    nparallel = 4)
plotFit(subs_fit_res,outdir = "signatureFit/")

#The signature exposures can be found here and correspond to the median of the bootstrapped
#runs followed by false positive filters. See ?Fit for details
snv_exp <- subs_fit_res$exposures

We can now convert the organ-specific signature exposures into reference signature exposures. This allows us to compare results across different organs, and to perform additional analysis such as HRDetect, where RefSig 3 and RefSig 8 exposures can be used in place of COSMIC signatures 3 and 8 exposures.

#Convert the organ-specific signature exposures into reference signature exposures
snv_exp <- convertExposuresFromOrganToRefSigs(expMatrix = t(snv_exp[,1:(ncol(snv_exp)-1)]),typemut = "subs")

#write the results
writeTable(snv_exp,"RefSigSubsExposures.tsv")

Example 03: signature fit using FitMS

In this example we show how to use the multi-step signature fit function along with the Gini-based exposure filter.

Similarly to Examples 01 and 02, we construct the mutational catalogues and then we just need to specify the organ of interest:

#perform signature fit using a multi-step approach where organ-specific
#common and rare signatures are used
subs_fit_res <- FitMS(catalogues = SNV_catalogues,
                      exposureFilterType = "giniScaledThreshold",
                      useBootstrap = TRUE,
                      organ = "Breast")
plotFitMS(subs_fit_res,outdir = "signatureFit/")

The function FitMS will select automatically the common and rare signatures to use according to the specified organ. As a first step it will fit the common signatures, and as a second step it will attempt to determine the presence of the rare signatures.

In this example we have also specified to use bootstrap and to use the giniScaledThreshold exposure filter method. Results can be plotted with the plotFitMS function.

In addition, it is possible to save all the data in the subs_fit_res object as a JSON file:

writeFitResultsToJSON(fitObj = subs_fit_res,
                      filename = "FitMSresults.json")

A manual for FitMS can be found in the userManuals folder.

Example 04: signature fit using signatureFit_pipeline

Here we provide an example of using the signatureFit_pipeline function introduced in v2.1.0. This function is meant to automate some recurrent tasks in signature analysis, such as catalogues generation and selection of signatures to fit.

Similarly to the examples above, we store the location of the files containing the single nucleotide variants into the SNV_tab_files variable:

#set sample names
sample_names <- c("sample1","sample2")
#set the file names.
SNV_tab_files <- c("../../tests/testthat/test_hrdetect_1/test_hrdetect_1.snv.simple.txt",
                   "../../tests/testthat/test_hrdetect_2/test_hrdetect_2.snv.simple.txt")
#name the vectors entries with the sample names
names(SNV_tab_files) <- sample_names

Then, we simply call the signatureFit_pipeline function:

pipeline_subs_res <- signatureFit_pipeline(SNV_tab_files = SNV_tab_files,
                                           organ = "Breast",genome.v = "hg19",
                                           fit_method = "FitMS",nparallel = 2)

In one function call, the SNV catalogues have been generated, and FitMS has been applied. We can now save the catalogues, as well as saving the annotated mutations and plotting the results from FitMS:

#save catalogues plots
plotSignatures(pipeline_subs_res$catalogues,
               output_file = "SNV_catalogues.pdf",
               ncolumns = 2)
#write annotated SNVs to file
writeTable(pipeline_subs_res$annotated_mutations,"annotated_SNVs.tsv")
#plot the FitMS results
plotFitResults(pipeline_subs_res$fitResults,outdir = "signatureFit/")

The function plotSignatures will infer the type of mutations and select the correct signatures plot function. In this case, plotSignatures will notice that these are SNV catalogues and use the plotSubsSignatures function. Moreover, the function plotFitResults will infer the fit method (Fit or FitMS) and use the appropriate plot function, in this case plotFitMS.

The signatureFit script that can be found in the scripts folder is a command line wrapper for the signatureFit_pipeline function.

Example 05: common and rare signature extraction workflow

In this example, we provide an example of signature extraction workflow where both common and rare signatures are extracted. For more details, please see the rare signatures extraction manual in the userManuals folder.

We being by importing the package and setting an output directory where we will store all the results.

#import the package
library(signature.tools.lib)
# set an output directory
outdir <- "~/Example05/"
dir.create(outdir,showWarnings = F,recursive = T)

We load the catalogue and perform a clustering to see if samples with rare profiles can be identified and excluded from the extraction of common signatures.

# these are simulated data with 100 catalogues (9 common and 2 rare signatures)
catalogues <- readTable("tests/testthat/rareExtraction/SDExample05/catalogues.tsv")
# we begin by exploring the data using clustering
resCl <- cataloguesClustering(catalogues,nclusters = 1:15,
                              outdir = paste0(outdir,"cataloguesClustering/"))

After inspection of the clustering results, we can clearly identify one cluster with just two samples that we can exclude (nclusters=5, remove cluster 5). Then, we perform a signature extraction using NMF on the remaining samples.

# Let's try to remove samples that may have a rare signatures first
nclustersSelection <- "5"
clusters_to_keep <- c(1:4)
samplesSelected <- rownames(resCl$clusters_table)[resCl$clusters_table[,nclustersSelection] %in% clusters_to_keep]
#now extract common signatures
SignatureExtraction(cat = catalogues[,samplesSelected],
                    outFilePath = paste0(outdir,"Extraction/"),
                    nrepeats = 25,nboots = 4,filterBestOfEachBootstrap = T,
                    nparallel = 4,nsig = 8:11,plotResultsFromAllClusteringMethods = F,
                    parallel = T)

We now need to decide how many signatures are present, based on the error and the average silhouette width. In this case, 9 or 10 signatures seem optimal. Let us choose 9 here, though feel free to try 10 as well and see how the results are affected downstream.

# load the optimal signatures (9)
estimated_signatures <- readTable(paste0(outdir,"Extraction/round_1/sig_9/Sigs_plot_extraction_ns9_nboots4.tsv"))

Now we have a set of common signatures and we want to find whether some samples are still not fully explained by these signatures and see if we can extract rare signatures out of them.

# is there any sample that was not fully explained by the signatures we found?
unexplSamples <- unexplainedSamples(outfileRoot = paste0(outdir,"unexplained/Example05"),
                                    catalogues = catalogues,
                                    sigs = estimated_signatures,
                                    nmuts_threshold = 300,
                                    pvalue_threshold = 0.15)

Indeed there are four samples that are not fully explained. As there is some randomness you might have to tune the nmuts_threshold and pvalue_threshold parameters to select them. Use the Example05_UnexplainedCataloguesSelectionPlot.pdf scatter plot in the unexplained folder to guide you.

We cluster the residuals of the four samples, and we identify two clusters of two samples, each of them having very similar residuals. We can use the samples in these clusters to then extract two rare mutational signatures.

# yes, 4 samples, get their residual
significant_residuals <- unexplSamples$all_residuals[,unexplSamples$which_significant]
# now we cluster the residuals
resCl_residuals <- cataloguesClustering(significant_residuals,
                                        nclusters = 1:3,
                                        outdir = paste0(outdir,"residualClustering/"))
# we decide that there are two clusters, having two samples in each
# now we need to extract the rare signatures
resRareSigs <- rareSignatureExtraction(outfileRoot = paste0(outdir,"ExtractionRare/Example05"),
                                       catalogues = catalogues,
                                       residuals = unexplSamples$all_residuals,
                                       unexpl_samples = unexplSamples$unexplSamples,
                                       clusters = resCl_residuals$clusters_table[,"2"],
                                       useclusters = list(c(1),c(2)),
                                       commonSignatures = estimated_signatures,
                                       commonExposures = unexplSamples$exposures)

Finally, we can use all the common and rare signatures, as well as the information about where we extracted the rare signatures from, to estimate the exposures of each signature in each sample.

# now let's get the exposures
resFinalExpo <- finaliseCommonRareSignatureExposures(outfileRoot = paste0(outdir,"ExtractionRare/Example05"),
                                                     catalogues = catalogues,
                                                     commonSigs = estimated_signatures,
                                                     listofsignatures = resRareSigs$listofsignatures,
                                                     listofsamples = resRareSigs$listofsamples,
                                                     nboot = 50,nparallel = 4)

Frequently Asked Questions

Can I use your package on my Whole Exome Sequenced data?

Mostly, no. All algorithms included in this package, from signature extraction, to signature fit, to HRDetect, have been developed and tested using only Whole Genome Sequenced data, and are meant to be used only on Whole Genome Sequenced data. These are some of the reasons why:

This said, you may still achieve something decent if you limit yourself to breast cancer, or other organs where we know fairly well what signatures to expect. And you should be able to spot most MMR samples and other hypermutated samples, simply because they have a much higher number of SNVs and Indels.

In R/HRDetect.R, the function applyHRDetectDavies2017 seems to have mean/sd values hardcoded for each input feature. This results in input data not being standardised to mean=0/sd=1. Is this intentional? I'm assuming these values come from the results for the 371 samples in the 2017 paper?

Yes it is intentional. In order for the parameters of the linear model to be meaningful the new input data has to be processed in the same way as the training data, which means that the same mean and sd of the training data should be provided. Please check the R documentation of the function for the details of how to input your data.

We have a set of samples from non-BRCA cancers, sequenced on a Novaseq, and aligned/called through our in-house pipeline. Are these differences in sequencing/processing large enough to render the parameters of the HRDetect model meaningless?

Before running HRDetect we make sure that the data is of a good quality and try to reduce artefacts and other false positives. We mostly worked with Sanger cgp pipeline, and also adapted output from other pipelines sometimes with filters to get the same level of specificity, and found that HRDetect is quite robust. What I mean is that rather than retraining HRDetect for different pipelines, we usually prefer to work on the pipeline until we are happy with the calls. I can advise to have a look at samples in the same tumour type from highly curated resources like PCAWG. You can browse some at https://signal.mutationalsignatures.com.

Should I use COSMIC signatures or tissue-specific signatures with my data? Which will give more accurate signature assignment?

In general, we expect tissue-specific signatures to be more accurate, as we have also demonstrated using simulations in Degasperi et al. 2022, Science, Figure S53, though in practice it may depend on the tumour type. Some of the tumour types have more reliable signatures than others (e.g. due to sample size, number of signatures present, similarity between signatures present...). Please double check on SIGNAL that you are happy with the signatures of your cancer type. From version 2.0 of signature.tools.lib, the signature fit algorithm FitMS automatically selects suitable common and rare signatures for fitting if the parameter organ is used. From version 2.1.0, the signatureFit_pipeline also allows the automatic selection of signatures to fit based on tumour types, and also allows to request COSMIC signatures. If COSMIC signatures are used, we advise to select a subset of the COSMIC signatures that are believed to be present in the tumour type to be analysed, rather than attempting to fit all signatures at once.

How do I interpret the Overall Metrics plot that I obtain from SignatureExtraction? What are all these metrics?

During the investigation of signature extraction methods, we have defined multiple useful metrics. Some of these were not reported in Degasperi et al. 2020, Nature Cancer, as we did not find them essential. The metrics are:

While it can be interesting to look at all these metrics, we recommend to mostly use norm.Error (orig. cat.) and Ave.SilWid, using a suitable number of bootstraps and repeats and with the Clustering with Matching (MC) clustering algorithm, as described in Degasperi et al. 2020, Nature Cancer.

What parameters should I use for signature extraction?

It depends on how many samples you have, with more samples requiring more repeats. In general, we advise to use 20 bootstraps, 200 repeats, the clustering with matching algorithm (CM), the KLD objective function (nmfmethod brunet) and RTOL=0.001. The number of repeats (nrepeats) is more important than the number of bootstraps, because at least 100 repeats are necessary to obtain enough results to then select best runs using the RTOL threshold (Extended Data Figure 1 of Degasperi et al. 2020, Nature Cancer). The disadvantage of this method is that this can increase the computation time required significantly if one uses more than 500/1000 samples.

Where do I find the activity (exposures) matrix after running the signature extraction?

We do not report the activity matrix from the extraction, only the signatures. We then use a signature fit procedure (here the function Fit or FitMS) to estimate the activities. We tend to think of the estimation of the activities as a separate procedure, performed holding a given set of a priori signatures fixed. This allows for more playing around with the signature fit, for example excluding clear false positive signatures from the fit.

Which Structural Variants caller and filters should I use?

Our team has been using mostly Manta and Brass. For Manta, we tend to tune false positives using the PR column.

https://github.com/Illumina/manta/tree/master/docs/userGuide

After running Manta you will have to convert your output to bedpe using a vcf to bedpe converter. There are a few out there, you can try this one:

https://github.com/arq5x/lumpy-sv/blob/master/scripts/vcfToBedpe