IARCbioinfo / needlestack

Multi-sample somatic variant caller
GNU General Public License v3.0
49 stars 15 forks source link
bam-files docker negative-binomial-regression nextflow ngs pipeline regression samtools somatic-mutations variants

A multi-sample somatic variant caller

Join the chat at https://gitter.im/iarcbioinfo/needlestack CircleCI Docker Hub

A preprint describing needlestack is available on biorxiv: https://www.biorxiv.org/content/10.1101/639377v1

Contact: follm@iarc.fr

Jump over here for a quick start!

Needlestack development is support by the US National Cancer Institute (grant number R21CA175979) and the French Institut national du cancer.

Workflow representation

WARNING

Since the last release, we have harmonized our pipelines and the following option names changed:

Description

Needlestack is an ultra-sensitive multi-sample variant caller for Next Generation Sequencing (NGS) data. It is based on the idea that analysing several samples together can help estimate the distribution of sequencing errors to accurately identify variants. It has been initially developed for somatic variant calling using very deep NGS data from circulating free DNA, but is also applicable to lower coverage data like Whole Exome Sequencing (WES) or even Whole Genome Sequencing (WGS). It is a highly scalable and reproducible pipeline thanks to the use of nextflow and Docker/Singularity container technologies.

Here is a summary of the method:

Note that needlestack is not performing a re-assembly of reads to determine haplotypes as some other modern variant callers are doing (Mutect2, Strelka2 etc.). For this reason we recommend using ABRA2, a realigner for NGS data, before running needlestack. We also host a nextflow pipeline for ABRA2.

The detection of a variant can be visualized graphically. Here are two examples of a variant detected, one from a series of samples with a coverage of 120X (left) and one with a very high coverage (50000X, right). Each dot represents a sequenced sample colored according to its Phred scale q-value. The black regression line shows the estimated sequencing-error rate along with the 99% confidence interval (black dotted lines) containing samples. Colored-dotted lines correspond to the limits of regions defined for different significance q-value thresholds. In both case one sample appears as outlier from the regression (in red), and is therefore classified as carrying the given mutation. Note that the position on the left has a high error rate, while the position on the right has a very low error rate, allowing the detection of a variant with a tiny allelic fraction (0.025%). Example of an alignment plot

Dependencies

Needlestack works under most Linux distributions and Apple OS X.

  1. This pipeline is based on nextflow. As we host several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.

  2. External software:

    • bedtools
    • samtools
    • R with Rscript
    • Compile the file mpileup2readcounts.cc located here
    • Add the previous tools to your path (executables are assumed to be respectively called samtools, Rscript and mpileup2readcounts)
    • Optionally, the bioconductor packages required for the alignments plots (see the Detailed description section)

You can avoid installing all the external software by using Docker or Singularity. See the IARC-nf repository for more information. Warning: the docker/singularity containers don't include the bioconductor packages needed to plot alignments.

To use needlestack without nextflow, in addition to the previous tools, download the files in this bin directory and add them to your path. See the Usage section to run needlestack without nextflow.

Input

Type Description
BAM files BAM files (called *.bam) grouped in a single folder or described in a text file (one BAM path per line) along with their index files (called *.bam.bai or *.bai). A minimum of 20 BAM files is recommended.
fasta file A reference fasta file (eventually compressed with bgzip) along with its faidx index (and *.gzi faidx index if compressed).
bed file Optional input file. Otherwise the variant calling is performed on the whole reference provided.

A sample dataset is available here for testing.

Parameters

Type --help to get the full list of options. All parameters are prefixed with a double-dash like in --help.

Name Example value Description
input_bams BAM/ or bams.txt Folder containing the BAM files
ref fasta_name.fa Reference fasta file
output_vcf vcf_name.vcf Name of the output VCF file generated by needlestack
Name Default value Description
min_dp * 30 Minimum median coverage to consider a site. In addition, at least 10 samples have to be covered by min_dp.
min_ao * 3 Minimum number of non-ref reads in at least one sample to consider a site
min_af * 0 Minimum allelic fraction in at least one sample to consider a site
nsplit 1 Split the bed file in nsplit pieces and run in parallel
min_qval 50 qvalue threshold in Phred scale to consider a variant
sb_type SOR Strand bias measure, either SOR, RVSB or FS
sb_snv 100 or 1000 Strand bias threshold for SNVs. The default of 100 (1000 if sb_type is FS) means no filter
sb_indel 100 or 1000 Strand bias threshold for indels. The default of 100 (1000 if sb_type is FS) means no filter
map_qual 0 Min mapping quality (passed to samtools)
base_qual 13 Min base quality (passed to samtools)
max_dp 50000 Downsample coverage per sample (passed to samtools)
plots SOMATIC if tn_pairs provided, ALL if not Creates pdf plots of regressions in the output. To remove plots set --plots to NONE (See Plot options paragraph).
genome_release - Reference genome for the annotions on the alignments plots. Examples: Hsapiens.UCSC.hg19, Hsapiens.UCSC.hg18, Hsapiens.UCSC.hg38, Mmusculus.UCSC.mm10. The right terminology is described below (Plot options paragraph). WARNING: this option is mandatory if the --do_alignments flag is used
output_folder --bam_folder Output folder, by default equals to the input bam folder
bed - BED file containing a list of regions (or positions) where needlestack will perform variant calling
region - A region in format CHR:START-END where calling should be done
tn_pairs - A tab-delimited file containing two columns (normal and tumor sample names) for each sample in line. This enables matched tumor/normal pair calling features (see below)
sigma_normal 0.1 Sigma parameter for negative binomial modeling of expected germline allelic fraction. We strongly recommend not to change this parameter unless you really know what it means
min_af_extra_rob 0.2 Minimum allelic fraction to exclude a sample at a position for extra-robust regression
min_prop_extra_rob 0.1 Minimum proportion of samples having an allelic fraction to be excluded from extra-robust regression
max_prop_extra_rob 0.8 Maximum proportion of samples having an allelic fraction to be excluded from extra-robust regression
power_min_af Allelic fraction used to classify genotypes as 0/0 or ./. depending of the power to detect a variant at this fraction (see below)
input_vcf A VCF file (from GATK) where calling should be done. Needlestack will extract DP and AO from this VCF (DP and AD fields) and annotate it with phred q-value score (FORMAT/QVAL field), error rate (INFO/ERR) and over-dispersion sigma parameter (INFO/SIG). WARNING: by default, only works with split (coreutils) version > 8.13

* Caution: min_dp, min_ao and min_af are position-based filters, which means that needlestack might still identify a variant with an allelic fraction lower than min_af in a sample if at the same position, another sample has an allelic fraction larger than min_af.

By default, if neither --bed nor --region are provided, needlestack runs on the whole reference genome provided, building a bed file from fasta index. If --bed and --region are both provided, it runs on the region only.

Flags are parameters without value.

Name Description
help Display help
all_SNVs Output all SNVs, even when no variant is detected
extra_robust_gl Perform extra-robust regression (useful for common germline SNPs, see below)
no_labels No label for the outliers on regression plots
no_indels Do not perform variant calling for insertions and deletions
no_contours Do not plot q-values contours (for q-value threshold={10,30,50,70,100} by default) and do not plot minimum detectable allelic fraction as a function function of coverage
use_file_name Use the bam file names as sample names. By default the sample name is extracted from the bam file SM tag
do_alignments Add alignments plots to the pdf plots of regressions

Usage

  1. Optionally download a sample dataset.

    git clone --depth=1 https://github.com/IARCbioinfo/data_test
  2. Run the pipeline using nextflow (here using docker).

    cd data_test
    nextflow run iarcbioinfo/needlestack -with-docker \
                     --bed BED/TP53_all.bed \
                     --input_bams BAM/BAM_multiple/ \
                     --ref REF/17.fasta \
                     --output_vcf all_variants.vcf 

    You will find a VCF file called all_variants.vcf in the BAM_multiple/ folder once done.

    The first time it will take more time as the pipeline will be downloaded from github and the docker container from dockerhub.

    Official releases can be found here. There is a corresponding official docker container for each release and one can run a particular version using (for example for v0.3):

    nextflow run iarcbioinfo/needlestack -r v0.3 -with-docker \
                --bed BED/TP53_all.bed --bam_folder BAM/BAM_multiple/ --fasta_ref REF/17.fasta --out_vcf all_variants.vcf

    It is also possible to run the pipeline without nextflow (not recommended) using the needlestack.sh. Here on the example dataset downloaded above:

    cd data_test/
    needlestack.sh --region=17:7572814-7573814 --bam_folder=BAM/BAM_multiple --ref=REF/17.fasta --output_vcf=all_variants.vcf

Output

Type Description
VCF file File containing all the variants detected. It contains a header of meta-information and data lines containing information about each variant position
PDF files For each variant detected a pdf is generated, this pdf file contains the regression plots and alignments plots if the --do_alignments flag is used

Detailed description

Germline, somatic, matched Tumor-Normal pairs calling and contamination

When using matched tumor/normal, Needlestack can classify variants (VCF FORMAT/STATUS field) according to the following table: Status Table

For this, one need to provide a tab-delimited file containing two columns with normal and tumor sample names using the --tn_pairs option. The first line of this file is a header with TUMOR and NORMAL keywords. When one normal or one tumor is missing, one can write NA. In this mode, the parameter power_min_af defines the allelic fraction in the tumor one is trying to detect to classify genotypes as ./. or 0/0 depending on the power to detect this allelic fraction. Variants found as somatic in a tumor, but germline in another sample of the series will be flagged as POSSIBLE_CONTAMINATION. We found this particularly important, as needlestack is very sensitive to low allelic fractions, to filter out contamination among samples, coming for example from pooled exome capture.

In other cases (when there is no --tn_pairs parameter defined), genotypes are defined as ./. or 0/0 assuming one is looking for allelic fractions expected for germline variants (negative binomial distribution centered at 0.5 with over-dispersion parameter sigma=sigma_normal, with sigma_normal=0.1 by default). If you are looking for somatic variants without matched-normal and assuming you are interesting to correctly distinguish ./. and 0/0genotypes, you can set the power_min_af parameter to the lowest allelic fraction of somatic variants you are interested with (and your coverage allows you to find). Note that this is by far not the most common situation, and that in most cases you don't have to worry about the power_min_af parameter.

Plot options

Two options to consider:

  1. --plots:

    • SOMATIC: produce pdf regression plots (like in the description above) only for somatic variants (option not available in the annotation mode). Default value when using matched tumor/normal.
    • ALL: produce pdf regression plots for all variants. Default value when not using matched tumor/normal or when using annotation mode.
    • NONE: remove pdf regression plots from the output.
  2. --do_alignments: To add the alignments plots after the regression plots. If this flag is used, the name of the reference genome (--genome_release option) needs to be provided to choose the correct annotation (See --genome_release option below).

The following bioconductor packages need to be installed for plotting the alignments:

Example: If the hg19 release of the human genome is used, the following packages should be installed: Gviz, TxDb.Hsapiens.UCSC.hg19.knownGene (hg18 and hg38 UCSC version can also be used) and org.Hs.eg.db

These packages exist for other organisms than Human but have not been tested. One can for example generate the alignments plot for mouse data by installing TxDb.Mmusculus.UCSC.mm10.knownGene (mm9 UCSC version can also be used) and org.Mm.eg.db. For the other organisms the packages need to have the same nomenclature as the ones listed above.

The --genome_release option needs to be provided and corresponds to the TxDb annotation package name without its prefix and suffix. For the hg19 release of the human genome, one needs to set --genome_release to Hsapiens.UCSC.hg19. Note that the packages chosen for the annotations are compatible with the UCSC notations since most of the Gviz fonctionalities can handle these notations. The reference genome used for the BAM alignments can be based on GENCODE, UCSC or ENSEMBL genome varieties.

Warning: the docker/singularity containers don't include the bioconductor packages needed to plot alignments.

Example of an alignment plot

The alignment plot represents from top to bottom:

Common variants

Needlestack is designed to identify rare variants (i.e. only a few samples in your set of samples have a particular variant), because of the robust regression used. Therefore, common SNPs (>10%) or strong somatic hotspots will be missed. The optional extra_robust_gl can overcome partially this issue for common germline mutations: it first discard high allelic fraction (>20%, assuming these are likely true variants) before fitting the regression model when between 10% and 50% of sample have such a high allelic fraction. A flag is written in the VCF INFO/WARN field when this happened (EXTRA_ROBUST_GL). Additionally, when an other allele than the reference allele is the most common, it is taken as the reference allele and a flag is also written in the VCF (INFO/WARN=INV_REF).

Strand bias

For conventional variant callers, GATK WES/WGS recommended values for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. We haven't found this particularly useful, but a more detailed evaluation is necessary. For amplicon based targeted sequencing, RVSB>0.85 seems to reduce some errors. There is no hard filter by default as this is easy to do afterward using bcftools filter command.

Contributions

Name Email Description
Matthieu Foll* follm@iarc.fr Lead developer, contact for support
Tiffany Delhomme* delhommet@students.iarc.fr Main developer, contact for support
Nicolas Alcala alcalan@fellows.iarc.fr tumor/normal pair mode, power calculation
Aurelie Gabriel gabriela@students.iarc.fr Alignment plot, nextflow pipeline, mpileup2readcounts, bash version

References

Needlestack has been used in these papers: