This PR implements tests to ensure only suitable VCF files are running through the pipeline and normalization and filtering steps and addresses issue #12.
VCF tests
It uses a combination of gatk, bcftools and a custom python script to perform these test.
The vcftests subworkflow checks for the following criteria, which are also summarized in the docs/output.md:
criteria
log-level
description
tool
VCF file format
ERROR
Checks if general structure of vcf file is adherent to VCF file format.
GATK4 ValidateVariants
uncompressed or bgzip compressed
ERROR
Checks if file is either uncompressed or bgzip compressed. gzipped files with .vcf.gz ending give an error during indexing.
bcftools index
single-sample VCF
ERROR
multi-sample VCF files are not supported.
bcftools stats
"chr" prefix in CHROM column
ERROR
Checks if each chromosome column in the vcf file contains the "chr" prefix.
python script
matching to reference genome
ERROR
Checks if provided VCF file matches the provided FASTA reference genome. Especially can differentiate between GRCh37 and GRCh38. If left-alignment of indels is activated, bcftools norm also checks the reference genome.
GATK4 ValidateVariants
only passed filters
WARNING
Checks if the FILTER column contains entries other than "PASS" or ".". NOTE: These can be removed with the the filter_pass parameter in the vcfproc module.
python script
no-ALT entries
WARNING
Genomic VCF files (gVCFs) are supported but can dramatically increase the runtime of VEP.
bcftools stats
no multiallelic sites
WARNING
Checks if the VCF file contains multiallelic variants and gives a warning. NOTE: These will be automatically split wiht bcftools norm in the vcfproc module.
bcftools stats
contains other variants than SNVs and InDels
WARNING
Checks if VCF file contains other variants.
bcftools stats
previous VEP annotation present
WARNING
Checks if previous VEP annotation is present by checking for VEP in the header and if INFO column already contains a CSQ key.
I also added indexing of VCF and FASTA files and creating a sequence dictionary for the reference FASTA file.
VCF filter and norm
I reorganized this into a separate subworkflow vcfproc. It composes two parts:
The first includes bcftools view for optional filtering variants based on the FILTER column entry.
The other includes bcftools norm and performs splitting of multi-allelic into biallelic variants, which is required by vembrane.
I also added two parameters to the pipeline:
--filter_vcf: If null, no filtering is applied. If set to STRING, only FILTER column entries passing STRING will be included (e.g. PASS). Default is null.
--left-align-indels: If true, perform left-alignment of Indels using bcftools norm. When this is enabled, it also add an additional reference genome check by bcftools norm.
Missing features
Testing and filtering by target regions: This is best implemented once the TMB-branch is merged, as this also includes the bedfile parameters.
Implement filter for rmeving nonvariant entries in gVCF files. This speeds up the annotation process. These variants could optionally be added to final output again, but not annotated to improve runtime.
This PR implements tests to ensure only suitable VCF files are running through the pipeline and normalization and filtering steps and addresses issue #12.
VCF tests
It uses a combination of gatk, bcftools and a custom python script to perform these test. The
vcftests
subworkflow checks for the following criteria, which are also summarized in thedocs/output.md
:GATK4 ValidateVariants
.vcf.gz
ending give an error during indexing.bcftools index
bcftools stats
python script
bcftools norm
also checks the reference genome.GATK4 ValidateVariants
filter_pass
parameter in the vcfproc module.python script
bcftools stats
bcftools norm
in the vcfproc module.bcftools stats
bcftools stats
VCF filter and norm
I reorganized this into a separate subworkflow
vcfproc
. It composes two parts: The first includesbcftools view
for optional filtering variants based on the FILTER column entry. The other includesbcftools norm
and performs splitting of multi-allelic into biallelic variants, which is required by vembrane. I also added two parameters to the pipeline:--filter_vcf
: If null, no filtering is applied. If set to STRING, only FILTER column entries passing STRING will be included (e.g. PASS). Default is null.--left-align-indels
: If true, perform left-alignment of Indels usingbcftools norm
. When this is enabled, it also add an additional reference genome check bybcftools norm
.Missing features