AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
129 stars 55 forks source link

Bioconda European Galaxy server

This is the Final Version of VarDict. No longer maintained.

VarDictJava

Introduction

VarDictJava is a variant discovery program written in Java and Perl. It is a Java port of VarDict variant caller.

The original Perl VarDict is a sensitive variant caller for both single and paired sample variant calling from BAM files. VarDict implements several novel features such as amplicon bias aware variant calling from targeted sequencing experiments, rescue of long indels by realigning bwa soft clipped reads and better scalability than many other Java based variant callers. The Java port is around 10x faster than the original Perl implementation.

Please cite VarDict:

Lai Z, Markovets A, Ahdesmaki M, Chapman B, Hofmann O, McEwen R, Johnson J, Dougherty B, Barrett JC, and Dry JR. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016, pii: gkw227.

The link to is article can be accessed through: https://academic.oup.com/nar/article/44/11/e108/2468301?searchresult=1

Original coded by Zhongwu Lai 2014.

VarDictJava can run in single sample (see Single sample mode section), paired sample (see Paired variant calling section), or amplicon bias aware modes. As input, VarDictJava takes reference genomes in FASTA format, aligned reads in BAM format, and target regions in BED format.

Requirements

  1. JDK 1.8 or later
  2. R language (uses /usr/bin/env R)
  3. Perl (uses /usr/bin/env perl)
  4. Internet connection to download dependencies using gradle.

To see the help page for the program, run

 <path_to_vardict_folder>/build/install/VarDict/bin/VarDict -H.

Getting started

Getting source code

The VarDictJava source code is located at https://github.com/AstraZeneca-NGS/VarDictJava.

To load the project, execute the following command:

git clone --recursive https://github.com/AstraZeneca-NGS/VarDictJava.git

Note that the original VarDict project is placed in this repository as a submodule and its contents can be found in the sub-directory VarDict in VarDictJava working folder. So when you use teststrandbias.R and var2vcf_valid.pl. (see details and examples below), you have to add prefix VarDict: VarDict/teststrandbias.R and VarDict/var2vcf_valid.pl.

Compiling

The project uses Gradle and already includes a gradlew script.

To build the project, in the root folder of the project, run the following command:

./gradlew clean installDist

Clean will remove all old files from build folder.

To generate Javadoc, in the build/docs/javadoc folder, run the following command. If you want to save content of build folder as it is (for example after building the project), run it without clean option:

./gradlew clean javadoc

To generate release version in the build/distributions folder as tar or zip archive, run the following command:

./gradlew distZip

or

./gradlew distTar

Distribution Package Structure

When the build command completes successfully, the build/install/VarDict folder contains the distribution package.

The distribution package has the following structure:

You can move the distribution package (the content of the build/install/VarDict folder) to any convenient location.

Generated zip and tar releases will also contain scripts from VarDict Perl repository in bin/ directory (teststrandbias.R, testsomatic.R, var2vcf_valid.pl, var2vcf_paired.pl).

You can add VarDictJava on PATH by adding this line to .bashrc:

export PATH=/path/to/VarDict/bin:$PATH

After that you can run VarDict by Vardict command instead of full path to <path_to_vardict_folder>/build/install/VarDict/bin/VarDict.

Third-Party Libraries

Currently, the project uses the following third-party libraries:

Single sample mode

To run VarDictJava in single sample mode, use a BAM file specified without the | symbol and perform Steps 3 and 4 (see the Program workflow section) using teststrandbias.R and var2vcf_valid.pl. The following is an example command to run in single sample mode with BED file.
You have to set options -c, -S, -E, -g using number of columns in your BED file for chromosome, start, end and gene of region respectively:

AF_THR="0.01" # minimum allele frequency
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N sample_name -b /path/to/my.bam -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f $AF_THR > vars.vcf

VarDictJava can also be invoked without a BED file if the region is specified in the command line with -R option. The following is an example command to run VarDictJava for a region (chromosome 7, position from 55270300 to 55270348, EGFR gene) with -R option:

<path_to_vardict_folder>/build/install/VarDict/bin/VarDict  -G /path/to/hg19.fa -f 0.001 -N sample_name -b /path/to/sample.bam -R  chr7:55270300-55270348:EGFR | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f 0.001 > vars.vcf

In single sample mode, output columns contain a description and statistical info for variants in the single sample. See section Output Columns for list of columns in the output.

Paired variant calling

To run paired variant calling, use BAM files specified as BAM1|BAM2 and perform Steps 3 and 4 (see the Program Workflow section) using testsomatic.R and var2vcf_paired.pl.

In this mode, the number of statistics columns in the output is doubled: one set of columns is for the first sample, the other - for second sample.

The following is an example command to run in paired mode.
You have to set options -c, -S, -E, -g using number of columns in your bed file for chromosome, start, end and gene of region respectively:

AF_THR="0.01" # minimum allele frequency
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N tumor_sample_name -b "/path/to/tumor.bam|/path/to/normal.bam" -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/testsomatic.R | VarDict/var2vcf_paired.pl -N "tumor_sample_name|normal_sample_name" -f $AF_THR

Amplicon based calling

This mode is active if the BED file uses 8-column format and the -R option is not specified.

In this mode, only the first list of BAM files is used even if the files are specified as BAM1|BAM2 - like for paired variant calling.

For each segment, the BED file specifies the list of positions as start and end positions (columns 7 and 8 of the BED file). The Amplicon based calling mode outputs a record for every position between start and end that has any variant other than the reference one (all positions with the -p option). For any of these positions, VarDict in amplicon based calling mode outputs the following:

For this running mode, the -a option (default: 10:0.95) specifies the criteria of discarding reads that are too far away from segments. A read is skipped if its start and end are more than 10 positions away from the segment ends and the overlap fraction between the read and the segment is less than 0.95.

Running Tests

Integration testing

The list of integration test cases is stored in files in testdata/intergationtestcases directory. To run all integration tests, the command is:

./gradlew test --tests com.astrazeneca.vardict.integrationtests.IntegrationTest 

The results of the tests can be viewed in the build/reports/tests/index.html file.

User extension of testcases

Each file in testdata/intergationtestcases directory represents a test case with input data and expected output

1. Create a txt file in testdata/intergationtestcases folder.

The file contains testcase input (of format described in Test cases file format) in the first line and expected output in the remaining file part.

2. Extend or create thin-FASTA in testdata/fastas folder.

3. Run tests.

Test cases file format

Each input file represents one test case input description. In the input file the first line consists of the following fields separated by , symbol:

Required fields:

Optional fields:

Parameters field:

Example of first line of input file:

Amplicon,hg19.fa,Colo829-18_S3-sort.bam,chr1,933866,934466,933866,934466,-a 10:0.95 -D
Somatic,hg19.fa,Colo829-18_S3-sort.bam|Colo829-19_S4-sort.bam,chr1,755917,756517
Simple,hg19.fa,Colo829-18_S3-sort.bam,chr1,9922,10122,-p
Test coverage Report

To build test coverage report run the following command:

./gradlew test jacocoTestReport 

Then HTML report could be found in build/reports/jacoco/test/html/index.html

Thin-FASTA Format

Thin fasta is needed to store only needed for tests regions of real references to decrease disk usage. Each thin-FASTA file is .csv file, each line of which represent part of reference data with information of:

thin-FASTA example:

chr1,1,15,ATGCCCCCCCCCAAA
chr1,200,205,GCCGA
chr2,10,12,AC

Note: VarDict expands given regions by 1200bp to left and right (plus given value by -x option).

Program Workflow

The main workflow

The VarDictJava program follows the workflow:

  1. Get regions of interest from a BED file or the command line.

  2. For each segment:

    1. Find all variants for this segment in mapped reads:
      1. Optionally skip duplicated reads, low mapping-quality reads, and reads having a large number of mismatches.
      2. Skip unmapped reads.
      3. Skip a read if it does not overlap with the segment.
      4. Preprocess the CIGAR string for each read (section CIGAR Preprocessing).
      5. For each position, create a variant. If a variant is already present, adjust its count using the adjCnt function.
      6. Combine SNVs into MNV or SNV with indel to complex if variants are located closer than -X option (default <=2 bases) and have good base qualities.
    2. Find structural variants (optionally can be disabled by option -U).
    3. Realign some of the variants using realignment of insertions, deletions, large insertions, and large deletions using unaligned parts of reads (soft-clipped ends). This step is optional and can be disabled using the -k 0 switch.
    4. Apply variant filtering rules (hard filters) defined in Variant filtering.
    5. Assign a type to each variant.
    6. Output variants in an intermediate internal format (tabular). Columns of the table are described in the Output Columns section.

      Note: To perform Steps 1 and 2, use Java VarDict.

  3. Perform a statistical test for strand bias using an R script.
    Note: Use R scripts teststrandbias.R or testsomatic.R for this step.

  4. Transform the intermediate tabular format to VCF. Output the variants with filtering and statistical data.
    Note: Use the Perl scripts var2vcf_valid.pl or var2vcf_paired.pl for this step. Be aware that var2vcf_valid.pl or var2vcf_paired.pl by default will output the variant with the highest AF on position: if few variants start at one position, only the highest will be added in VCF. To output all variants use -A option with these perl scripts.

CIGAR Preprocessing (Initial Realignment)

Read alignment is specified in a BAM file as a CIGAR string. VarDict modifies this string (and alignment) in the following special cases:

Variants

Simple variants (SNV, simple insertions, and deletions) are constructed in the following way:

Structural Variants are looked for after simple variants. VarDict supported DUP, INV and DEL structural variants.

Variant Description String

The description string encodes a variant for VarDict internal use.

The following table describes Variant description string encoding:

String Description
[ATGC] for SNPs
+[ATGC]+ for insertions
-[0-9]+ for deletions
...#[ATGC]+ for insertion/deletion variants followed by a short matched sequence
...^[ATGC]+ something followed by an insertion
...^[0-9]+ something followed by a deletion
...&[ATGC]+ for insertion/deletion variants followed by a matched sequence

Variant Filtering

A variant appears in the output if it satisfies the following criteria (in this order). If variant doesn't fit criteria on the step, it will be filtered out and the next steps won't be checked (except for the step 8, read the explanation below):

  1. Frequency of the variant exceeds the threshold set by the -f option (default = 1%).
  2. The minimum number of high-quality reads supporting variant is larger than the threshold set by the -r option (default = 2).
  3. The mean position of the variant in reads is larger than the value set by the -P option (default = 5).
  4. The mean base quality (phred score) for the variant is larger than the threshold set by the -q option (default = 22.5).
  5. Variant frequency is more than 25% or reference allele does not have much better mapping quality than the variant.
  6. Deletion variants are not located in the regions where the reference genome is missing.
  7. The ratio of high-quality reads to low-quality reads is larger than the threshold specified by -o option (default=1.5).
  8. Variant frequency exceeds 30%. If so, next steps won't be checked and variant considered as "good". Otherwise, other steps will be also checked.
  9. Mean mapping quality exceeds the threshold set by the -O option (default: no filtering)
  10. In the case of an MSI region, the variant size is less than 12 nucleotides for the non-monomer MSI or 15 for the monomer MSI. Variant frequency is more than 10% for the non-monomer MSI (or set by --nmfreq option) and 25% for the monomer MSI (or set by --mfreq option).
  11. Variant has not "2;1" bias or variant frequency more than 20%. If both conditions aren't met, then variant mustn't be SNV and any of variants refallele or varallele lengths must be more than 3 nucleotides.

Bias flag explanation

Bias flag can take values [0-2];[0-2] (i.e. "0;2", "2;1" and separator can be another in paired and single VCF). The first value refers to reads that support the reference allele, and the second to reads that support the variant allele.

0 - small total count of reads (less than 12 for the sum of forward and reverse reads) 1 - strand bias 2 - no strand bias

Variant classification in paired(somatic) analysis

In paired analysis, VarDict will classify each variant into the following types that are propagated into STATUS info tag after var2vcf_paired.pl script.

When both samples have coverage for the variant:

When only one sample has coverage for the variant:

These are only rough classification. You need to examine the p-value (after testsomatic.R script) to determine whether or not it's significant.

Program Options

VarDictJava options

Important var2vcf_valid.pl options

The full list of options in VarDictPerl var2vcf_valid.pl -h

Important var2vcf_paired.pl options

The full list of options in VarDictPerl var2vcf_paired.pl -h

Output columns

Simple mode:

  1. Sample - sample name
  2. Gene - gene name from a BED file
  3. Chr - chromosome name
  4. Start - start position of the variation
  5. End - end position of the variation
  6. Ref - reference sequence
  7. Alt - variant sequence
  8. Depth (DP) - total coverage
  9. AltDepth (VD) - variant coverage
  10. RefFwdReads (REFBIAS) - reference forward strand coverage
  11. RefRevReads (REFBIAS) - reference reverse strand coverage
  12. AltFwdReads (VARBIAS) - variant forward strand coverage
  13. AltRevReads (VARBIAS) - variant reverse strand coverage
  14. Genotype - genotype description string
  15. AF - allele frequency
  16. Bias - strand bias flag
  17. PMean - mean position in read
  18. PStd - flag for read position standard deviation (1 if the variant is covered by at least 2 read segments with different positions, otherwise 0).
  19. QMean - mean base quality
  20. QStd - flag for base quality standard deviation
  21. MAPQ - mapping quality
  22. QRATIO (SN) - ratio of high quality reads to low-quality reads
  23. HIFREQ (HIAF) - variant frequency for high-quality reads
  24. EXTRAFR (ADJAF) - Adjusted AF for indels due to local realignment
  25. SHIFT3 - No. of bases to be shifted to 3 prime for deletions due to alternative alignment
  26. MSI - MicroSatellite. > 1 indicates MSI
  27. MSILEN - MicroSatellite unit length in bp
  28. NM - average number of mismatches for reads containing the variant
  29. HICNT - number of high-quality reads with the variant
  30. HICOV - position coverage by high quality reads
  31. 5pFlankSeq (LSEQ) - neighboring reference sequence to 5' end
  32. 3pFlankSeq (RSEQ) - neighboring reference sequence to 3' end
  33. SEGMENT:CHR_START_END - position description
  34. VARTYPE - variant type
  35. DUPRATE - duplication rate in fraction
  36. SV splits-pairs-clusters: Splits (SPLITREAD) - No. of split reads supporting SV, Pairs (SPANPAIR) - No. of pairs supporting SV, Clusters - No. of clusters supporting SV
  37. CRISPR - only in crispr mode - how close to a CRISPR site is the variant

Amplicon mode

In amplicon mode columns from #35 are changed to:
(35) GoodVarCount (GDAMP) - number of good variants on amplicon
(36) TotalVarCount (TLAMP) - number of good and bad variants on amplicon
(37) Nocov (NCAMP) - number of variants on amplicon that has depth less than 1/50 of the max depth (they will be considered not working and thus not used).
(38) Ampflag - if there are different good variants on different amplicons, it will be 1.

Somatic mode

In somatic mode we have information from both samples:

  1. Sample - sample name
  2. Gene - gene name from a BED file
  3. Chr - chromosome name
  4. Start - start position of the variation
  5. End - end position of the variation
  6. Ref - reference sequence
  7. Alt - variant sequence
    Fields from first sample:
  8. Depth (DP) - total coverage
  9. AltDepth (VD) - variant coverage
  10. RefFwdReads (REFBIAS) - reference forward strand coverage
  11. RefRevReads (REFBIAS) - reference reverse strand coverage
  12. AltFwdReads (VARBIAS) - variant forward strand coverage
  13. AltRevReads (VARBIAS) - variant reverse strand coverage
  14. Genotype - genotype description string
  15. AF - allele frequency
  16. Bias - strand bias flag
  17. PMean - mean position in read
  18. PStd - flag for read position standard deviation
  19. QMean - mean base quality
  20. QStd - flag for base quality standard deviation
  21. MAPQ - mapping quality
  22. QRATIO (SN) - ratio of high quality reads to low-quality reads
  23. HIFREQ (HIAF) - variant frequency for high-quality reads
  24. EXTRAFR (ADJAF) - Adjusted AF for indels due to local realignment
  25. NM - average number of mismatches for reads containing the variant
    Fields from second sample:
  26. Depth - total coverage
  27. AltDepth - variant coverage
  28. RefFwdReads (REFBIAS) - reference forward strand coverage
  29. RefRevReads (REFBIAS) - reference reverse strand coverage
  30. AltFwdReads (VARBIAS) - variant forward strand coverage
  31. AltRevReads (VARBIAS) - variant reverse strand coverage
  32. Genotype - genotype description string
  33. AF - allele frequency
  34. Bias - strand bias flag
  35. PMean - mean position in read
  36. PStd - flag for read position standard deviation
  37. QMean - mean base quality
  38. QStd - flag for base quality standard deviation
  39. MAPQ - mapping quality
  40. QRATIO (SN) - ratio of high quality reads to low-quality reads
  41. HIFREQ (HIAF) - variant frequency for high-quality reads
  42. EXTRAFR (ADJAF) - Adjusted AF for indels due to local realignment
  43. NM - average number of mismatches for reads containing the variant
    Common fields:
  44. SHIFT3 - No. of bases to be shifted to 3 prime for deletions due to alternative alignment
  45. MSI - MicroSatellite. > 1 indicates MSI
  46. MSILEN - MicroSatellite unit length in bp
  47. 5pFlankSeq (LSEQ) - neighboring reference sequence to 5' end
  48. 3pFlankSeq (RSEQ) - neighboring reference sequence to 3' end
  49. SEGMENT:CHR_START_END - position description
  50. VarLabel - variant label due to type: StrongLOH, StrongSomatic...
  51. VARTYPE - variant type
  52. DUPRATE1 - duplication rate in fraction from first sample
  53. SV_info1 - Splits - No. of split reads supporting SV, Pairs - No. of pairs supporting SV, Clusters - No. of clusters supporting SV from first sample
  54. DUPRATE2 - duplication rate in fraction from second sample
  55. SV_info2: Splits - No. of split reads supporting SV, Pairs - No. of pairs supporting SV, Clusters - No. of clusters supporting SV from second sample

Input Files

BED File – Regions

VarDict uses 2 types of BED files for specifying regions of interest: 8-column and all others. The 8-column file format is used for targeted DNA deep sequencing analysis (amplicon based calling), amplicon analysis will try to start if BED with 8 columns was provided. Otherwise you can start single and paired sample analysis by providing options -c, -S, -E, -g with number of columns for chromosome, start, end, gene of the region respectively.

All lines starting with #, browser, and track in a BED file are skipped. The column delimiter can be specified as the -d option (the default value is a tab “\t“).

The 8-column amplicon BED file format involves the following data:

For example 4-column BED file format involves the following data and VarDict must be start with -c 1 -S 2 -E 3 -g 4 to recognize it:

FASTA File - Reference Genome

The reference genome in FASTA format is read using HTSJDK library. For every invocation of the VarDict pipeline (usually 1 for a region in a BED file) and for every BAM file, a part of the reference genome is extracted from the FASTA file. In some cases of Structural Variants finding the reference can be reread in other regions.

Region of FASTA extends and this extension can be regulated via the REFEXT variable (option -Y INT, default 1200 bp).

Errors and warnings

Information about some of the errors and their causes is located in wiki

License

The code is freely available under the MIT license.

Contributors

Java port of VarDict implemented based on the original Perl version (Zhongwu Lai) by: