OLF-Bioinformatics / VariantDetective

Identify short variants and structural variants from raw sequencing data or genomic sequences
MIT License
14 stars 1 forks source link

VariantDetective

This program is designed to identify short variants and structural variants. Variants can be identified from genomic sequences (FASTA) or from combinations of short and/or long reads (FASTQ). If genomic sequences are provided as input, long reads will be simulated to detect variants.

This tool makes use of other open-source variant callers and creates consensus sets in order to validate a variant. Summary files for short variants and structural variants are generated outlining the different types of variants found in the sample.

Author

Phil Charron \<phil.charron@inspection.gc.ca>

Table of Contents

Installation

All software and tools used by VariantDetective can be found in the spec-file.txt. VariantDetective can be installed via pip after creating the conda environment to support it or via conda.

Installation from Source

VariantDetective can be installed from source using the following method.

# Download VariantDetective repository
git clone https://github.com/OLF-Bioinformatics/VariantDetective.git
cd VariantDetective
# Create conda variant for tools
conda create -n variantdetective -y && conda activate variantdetective
# Install specific versions of tools
conda install -n variantdetective --file spec-file.txt
# Install VariantDetective
pip install -e .

Conda Installation

conda create -n vd -y
conda activate vd
conda install -c bioconda -c conda-forge -c charronp variantdetective

Test Data

After successfully installing VariantDetective, you can verify its functionality running test data. This step ensures that the isntallation has been completed correctly and the project is functioning as expected.

From within the VariantDetective directory, the test can be run using the following command:

variantdetective all_variants -r testdata/testdata_ref.fasta -g testdata/testdata_mut.fasta -o testdata/test

Once the tool is done, check the output. Compare the testdata/test/snp_indel/snp_final.vcf file with testdata/testdata_snp.vcf and the testdata/test/structural_variant/combined_sv.vcf file with testdata/testdata_sv.vcf.

Quick Usage

Find snps/indels and structural variants from an assembled genome (FASTA)

variantdetective all_variants -r REFERENCE.fasta -g SAMPLE.fasta

Find snps/indels and structural variants from raw reads (FASTQ)

variantdetective all_variants -r REFERENCE.fasta -1 SHORT_READ_1.fastq -2 SHORT_READ_2.fastq -l LONG_READ.fastq

Find snps/indels from an assembled genome (FASTA)

variantdetective snp_indel -r REFERENCE.fasta -g SAMPLE.fasta

Find snps/indels from raw reads (FASTQ)

variantdetective snp_indel -r REFERENCE.fasta -1 SHORT_READ_1.fastq -2 SHORT_READ_2.fastq 

Find structural variants from an assembled genome (FASTA)

variantdetective structural_variant -r REFERENCE.fasta -g SAMPLE.fasta

Find structural variants from raw reads (FASTQ)

variantdetective structural_variant -r REFERENCE.fasta -l LONG_READ.fastq

Combine SNP VCF files predicted from other tools and get consensus set of minimum 2 callers

variantdetective combine_variants --snp_vcf  TOOL1.vcf TOOL2.vcf TOOL3.VCF --snp_consensus 2

Combine SV VCF files predicted from other tools and get consensus set of minimum 2 callers

variantdetective combine_variants --sv_vcf  TOOL1.vcf TOOL2.vcf TOOL3.VCF --sv_consensus 2

List of Commands

Command Description
variantdetective all_variants Identify structural variants (SV) from long reads (FASTQ) and SNPs/indels from short reads (FASTQ), or both types of variants from genome sequence (FASTA). If genome sequence (FASTA) is provided, reads will be simulated to predict SV, SNPs and indels.
variantdetective structural_variant Identify structural variants (SV) from long reads (FASTQ) or genome sequence (FASTA). If genome sequence (FASTA) is provided, reads will be simulated to predict SVs.
variantdetective snp_indel Identify SNPs/indels from short reads (FASTQ) or genome sequence (FASTA). If genome sequence (FASTA) is provided instead, reads will be simulated to predict SNPs and indels.
variantdetective combine_variants Combine SNPs/indels VCF files or SV VCF files predicted from other tools.

Variant Callers

VariantDetective makes use of published open-source variant callers and creates consensus sets in order to validate a variant.

Short Variant Callers

Intersections of VCF files are created using the VCFtools vcf-isec tool. The final VCF output consensus file containing variants found in at least 2 variant callers (default) is created using the BCFtools concat tool.

Structural Variant Callers

The consensus VCF file is created using the SURVIVOR merge tool. Parameters for merging structural variants are a maximum allowed distance of 1 kbp between breakpoints and calls supported by at least 3 variant callers (default) where they agree on both type and strand.

Long Read Simulator

When a genomic FASTA file is provided as query input, long reads are simulated in order to detect variants. The long read simulation tool is adapted from Badread, a tool that creates simulated reads. It has been modified to create perfectly matching reads to the reference file and to allow multi-threading to speed up the process.

Parameters

All input files can be uncompressed (.fasta/.fastq) or gzipped (.fastq.gz/.fastq.gz)

Options Available Command Description Default
-r FASTA all_variants
structural_variant
snp_indel
Path to reference genome in FASTA. Required -
-g FASTA all_variants
structural_variant
snp_indel
Path to query genomic FASTA file. Can't be combined with -1, -2 or -l -
-1 FASTQ
--short1 FASTQ
all_variants
snp_indel
Path to pair 1 of short reads FASTQ file. Must always be combined with -2. If running all_variants, must be combined with -l -
-2 FASTQ
--short2 FASTQ
all_variants
snp_indel
Path to pair 2 of short reads FASTQ file. Must always be combined with -1. If running all_variants, must be combined with -l -
-l FASTQ
--long FASTQ
all_variants
structural_variant
Path to long reads FASTQ file. If running all_variants, must be combined with -1 and -2 -
--readcov READCOV all_variants
structural_variant
snp_indel
If using -g as input, define the absolute amount of simulated reads (e.g. 250M) or relative simulated read depth (e.g. 50x) 50x
--readlen MEAN,STDEV all_variants
structural_variant
snp_indel
If using -g as input, define the mean length and standard deviation of simulated reads 15000,13000
--mincov_snp MINCOV_SNP all_variants
snp_indel
Minimum number of reads required to call SNP/Indel 2
--minqual_snp MINQUAL_SNP all_variants
snp_indel
Minimum quality of SNP/Indel to be filtered out 20
--assembler {bwa,minimap2} all_variants
snp_indel
Choose which assembler (bwa or minimap2) to use when using paired-end short reads bwa
--snp_consensus SNP_CONSENSUS all_variants
snp_indel
Specifies the minimum number of tools required to detect an SNP or Indel to include it in the consensus list 2
--mincov_sv MINCOV_SV all_variants
structural_variant
Minimum number of reads required to call SV 2
--minlen_sv MINLEN_SV all_variants
structural_variant
Minimum length of SV to be detected 25
--minqual_sv MINQUAL_SV all_variants
structural_variant
Minimum quality of SV to be filtered out from SVIM 15
--sv_consensus SV_CONSENSUS all_variants
structural_variant
Specifies the minimum number of tools required to detect an SV to include it in the consensus list 3
-o OUT
--out OUT
all_variants
structural_variant
snp_indel
Output directory. Will be created if it does not exist ./
-t THREADS
--threads THREADS
all_variants
structural_variant
snp_indel
Number of threads used for job 1
-h
--help
all_variants
structural_variant
snp_indel
Show help message and exit -
-v
--version
all_variants
structural_variant
snp_indel
Show program version number and exit -

Outputs

All input files will be copied to the output folder. Within the output folder, directories containing the structural_variant and snp_indel results will be created.

Output files - snp_indel directory

Output Description
snp_final.vcf Variants that were found in at least 2 variant callers in VCF format
snp_final.csv Variants that were found in at least 2 variant callers in CSV format
snp_final.tab Variants that were found in at least 2 variant callers in TSV format
snp_final_summary.txt Summary of different short variant types found in snp_final files
freebayes.haplotypecaller.clair3.vcf.gz Variants in common between all variants callers
freebayes.clair3.vcf.gz Variants in common between Freebayes and Clair3
freebayes.haplotypecaller.vcf.gz Variants in common between Freebayes and HaplotypeCaller
haplotypecaller.unique.vcf.gz Variants in common between HaplotypeCaller and Clair3
clair3.unique.vcf.gz Variants only found by Clair3
freebayes.unique.vcf.gz Variants only found by Freebayes
haplotypecaller.unique.vcf.gz Variants only found by HaplotypeCaller
alignment.mm.rg.sorted.bam Alignment in BAM format
alignment.mm.rg.sorted.bam.bai Index file of alignments
clair3/ Directory containing files related to Clair3 variant calling
freebayes/ Directory containing files related to Freebayes variant calling
haplotypecaller/ Directory containing files related to HaplotypeCaller variant calling

Output files - structural_variant directory

Output Description
combined_sv.vcf Variants that were found in at least 2 variant callers in VCF format
combined_sv.csv Variants that were found in at least 2 variant callers in CSV format
combined_sv.tab Variants that were found in at least 2 variant callers in TSV format
combined_sv_summary.txt Summary of different structural variant types found in combined_sv files
alignment.mm.sorted.bam Alignment in BAM format
alignment.mm.sorted.bam.bai Index file of alignments
cutesv/ Directory containing files related to cuteSV variant calling
nanosv/ Directory containing files related to NanoSV variant calling
nanovar/ Directory containing files related to NanoVar variant calling
svim/ Directory containing files related to SVIM variant calling

Reporting Issues

If you have any issues installing or running VariantDetective, or would like a new feature added to the tool, please open an issue here on GitHub.

Citing VariantDetective

The manuscript describing this tool is available here.

The tool should be cited as follows:

Philippe Charron, Mingsong Kang, "VariantDetective: An Accurate All-in-One Pipeline for Detecting Consensus Bacterial SNPs and SVs," Bioinformatics, Vol. 40, No. 2, February 2024, btae066, https://doi.org/10.1093/bioinformatics/btae066.