This program is designed to identify short variants and structural variants. Variants can be identified from genomic sequences (FASTA) or from combinations of short and/or long reads (FASTQ). If genomic sequences are provided as input, long reads will be simulated to detect variants.
This tool makes use of other open-source variant callers and creates consensus sets in order to validate a variant. Summary files for short variants and structural variants are generated outlining the different types of variants found in the sample.
Phil Charron \<phil.charron@inspection.gc.ca>
All software and tools used by VariantDetective can be found in the spec-file.txt. VariantDetective can be installed via pip after creating the conda environment to support it or via conda.
VariantDetective can be installed from source using the following method.
# Download VariantDetective repository
git clone https://github.com/OLF-Bioinformatics/VariantDetective.git
cd VariantDetective
# Create conda variant for tools
conda create -n variantdetective -y && conda activate variantdetective
# Install specific versions of tools
conda install -n variantdetective --file spec-file.txt
# Install VariantDetective
pip install -e .
conda create -n vd -y
conda activate vd
conda install -c bioconda -c conda-forge -c charronp variantdetective
After successfully installing VariantDetective, you can verify its functionality running test data. This step ensures that the isntallation has been completed correctly and the project is functioning as expected.
From within the VariantDetective directory, the test can be run using the following command:
variantdetective all_variants -r testdata/testdata_ref.fasta -g testdata/testdata_mut.fasta -o testdata/test
Once the tool is done, check the output. Compare the testdata/test/snp_indel/snp_final.vcf
file with testdata/testdata_snp.vcf
and the testdata/test/structural_variant/combined_sv.vcf
file with testdata/testdata_sv.vcf
.
Find snps/indels and structural variants from an assembled genome (FASTA)
variantdetective all_variants -r REFERENCE.fasta -g SAMPLE.fasta
Find snps/indels and structural variants from raw reads (FASTQ)
variantdetective all_variants -r REFERENCE.fasta -1 SHORT_READ_1.fastq -2 SHORT_READ_2.fastq -l LONG_READ.fastq
Find snps/indels from an assembled genome (FASTA)
variantdetective snp_indel -r REFERENCE.fasta -g SAMPLE.fasta
Find snps/indels from raw reads (FASTQ)
variantdetective snp_indel -r REFERENCE.fasta -1 SHORT_READ_1.fastq -2 SHORT_READ_2.fastq
Find structural variants from an assembled genome (FASTA)
variantdetective structural_variant -r REFERENCE.fasta -g SAMPLE.fasta
Find structural variants from raw reads (FASTQ)
variantdetective structural_variant -r REFERENCE.fasta -l LONG_READ.fastq
Combine SNP VCF files predicted from other tools and get consensus set of minimum 2 callers
variantdetective combine_variants --snp_vcf TOOL1.vcf TOOL2.vcf TOOL3.VCF --snp_consensus 2
Combine SV VCF files predicted from other tools and get consensus set of minimum 2 callers
variantdetective combine_variants --sv_vcf TOOL1.vcf TOOL2.vcf TOOL3.VCF --sv_consensus 2
Command | Description |
---|---|
variantdetective all_variants |
Identify structural variants (SV) from long reads (FASTQ) and SNPs/indels from short reads (FASTQ), or both types of variants from genome sequence (FASTA). If genome sequence (FASTA) is provided, reads will be simulated to predict SV, SNPs and indels. |
variantdetective structural_variant |
Identify structural variants (SV) from long reads (FASTQ) or genome sequence (FASTA). If genome sequence (FASTA) is provided, reads will be simulated to predict SVs. |
variantdetective snp_indel |
Identify SNPs/indels from short reads (FASTQ) or genome sequence (FASTA). If genome sequence (FASTA) is provided instead, reads will be simulated to predict SNPs and indels. |
variantdetective combine_variants |
Combine SNPs/indels VCF files or SV VCF files predicted from other tools. |
VariantDetective makes use of published open-source variant callers and creates consensus sets in order to validate a variant.
Intersections of VCF files are created using the VCFtools vcf-isec
tool. The final VCF output consensus file containing variants found in at least 2 variant callers (default) is created using the BCFtools concat
tool.
The consensus VCF file is created using the SURVIVOR merge
tool. Parameters for merging structural variants are a maximum allowed distance of 1 kbp between breakpoints and calls supported by at least 3 variant callers (default) where they agree on both type and strand.
When a genomic FASTA file is provided as query input, long reads are simulated in order to detect variants. The long read simulation tool is adapted from Badread, a tool that creates simulated reads. It has been modified to create perfectly matching reads to the reference file and to allow multi-threading to speed up the process.
All input files can be uncompressed (.fasta/.fastq) or gzipped (.fastq.gz/.fastq.gz)
Options | Available Command | Description | Default |
---|---|---|---|
-r FASTA |
all_variants structural_variant snp_indel |
Path to reference genome in FASTA. Required | - |
-g FASTA |
all_variants structural_variant snp_indel |
Path to query genomic FASTA file. Can't be combined with -1 , -2 or -l |
- |
-1 FASTQ --short1 FASTQ |
all_variants snp_indel |
Path to pair 1 of short reads FASTQ file. Must always be combined with -2 . If running all_variants , must be combined with -l |
- |
-2 FASTQ --short2 FASTQ |
all_variants snp_indel |
Path to pair 2 of short reads FASTQ file. Must always be combined with -1 . If running all_variants , must be combined with -l |
- |
-l FASTQ --long FASTQ |
all_variants structural_variant |
Path to long reads FASTQ file. If running all_variants , must be combined with -1 and -2 |
- |
--readcov READCOV |
all_variants structural_variant snp_indel |
If using -g as input, define the absolute amount of simulated reads (e.g. 250M) or relative simulated read depth (e.g. 50x) |
50x |
--readlen MEAN,STDEV |
all_variants structural_variant snp_indel |
If using -g as input, define the mean length and standard deviation of simulated reads |
15000,13000 |
--mincov_snp MINCOV_SNP |
all_variants snp_indel |
Minimum number of reads required to call SNP/Indel | 2 |
--minqual_snp MINQUAL_SNP |
all_variants snp_indel |
Minimum quality of SNP/Indel to be filtered out | 20 |
--assembler {bwa,minimap2} |
all_variants snp_indel |
Choose which assembler (bwa or minimap2) to use when using paired-end short reads | bwa |
--snp_consensus SNP_CONSENSUS |
all_variants snp_indel |
Specifies the minimum number of tools required to detect an SNP or Indel to include it in the consensus list | 2 |
--mincov_sv MINCOV_SV |
all_variants structural_variant |
Minimum number of reads required to call SV | 2 |
--minlen_sv MINLEN_SV |
all_variants structural_variant |
Minimum length of SV to be detected | 25 |
--minqual_sv MINQUAL_SV |
all_variants structural_variant |
Minimum quality of SV to be filtered out from SVIM | 15 |
--sv_consensus SV_CONSENSUS |
all_variants structural_variant |
Specifies the minimum number of tools required to detect an SV to include it in the consensus list | 3 |
-o OUT --out OUT |
all_variants structural_variant snp_indel |
Output directory. Will be created if it does not exist | ./ |
-t THREADS --threads THREADS |
all_variants structural_variant snp_indel |
Number of threads used for job | 1 |
-h --help |
all_variants structural_variant snp_indel |
Show help message and exit | - |
-v --version |
all_variants structural_variant snp_indel |
Show program version number and exit | - |
All input files will be copied to the output folder. Within the output folder, directories containing the structural_variant
and snp_indel
results will be created.
snp_indel
directoryOutput | Description |
---|---|
snp_final.vcf |
Variants that were found in at least 2 variant callers in VCF format |
snp_final.csv |
Variants that were found in at least 2 variant callers in CSV format |
snp_final.tab |
Variants that were found in at least 2 variant callers in TSV format |
snp_final_summary.txt |
Summary of different short variant types found in snp_final files |
freebayes.haplotypecaller.clair3.vcf.gz |
Variants in common between all variants callers |
freebayes.clair3.vcf.gz |
Variants in common between Freebayes and Clair3 |
freebayes.haplotypecaller.vcf.gz |
Variants in common between Freebayes and HaplotypeCaller |
haplotypecaller.unique.vcf.gz |
Variants in common between HaplotypeCaller and Clair3 |
clair3.unique.vcf.gz |
Variants only found by Clair3 |
freebayes.unique.vcf.gz |
Variants only found by Freebayes |
haplotypecaller.unique.vcf.gz |
Variants only found by HaplotypeCaller |
alignment.mm.rg.sorted.bam |
Alignment in BAM format |
alignment.mm.rg.sorted.bam.bai |
Index file of alignments |
clair3/ |
Directory containing files related to Clair3 variant calling |
freebayes/ |
Directory containing files related to Freebayes variant calling |
haplotypecaller/ |
Directory containing files related to HaplotypeCaller variant calling |
structural_variant
directoryOutput | Description |
---|---|
combined_sv.vcf |
Variants that were found in at least 2 variant callers in VCF format |
combined_sv.csv |
Variants that were found in at least 2 variant callers in CSV format |
combined_sv.tab |
Variants that were found in at least 2 variant callers in TSV format |
combined_sv_summary.txt |
Summary of different structural variant types found in combined_sv files |
alignment.mm.sorted.bam |
Alignment in BAM format |
alignment.mm.sorted.bam.bai |
Index file of alignments |
cutesv/ |
Directory containing files related to cuteSV variant calling |
nanosv/ |
Directory containing files related to NanoSV variant calling |
nanovar/ |
Directory containing files related to NanoVar variant calling |
svim/ |
Directory containing files related to SVIM variant calling |
If you have any issues installing or running VariantDetective, or would like a new feature added to the tool, please open an issue here on GitHub.
The manuscript describing this tool is available here.
The tool should be cited as follows:
Philippe Charron, Mingsong Kang, "VariantDetective: An Accurate All-in-One Pipeline for Detecting Consensus Bacterial SNPs and SVs," Bioinformatics, Vol. 40, No. 2, February 2024, btae066, https://doi.org/10.1093/bioinformatics/btae066.