Add variant qc - Githubissues

This PR implements tests to ensure only suitable VCF files are running through the pipeline and normalization and filtering steps and addresses issue #12.

VCF tests

It uses a combination of gatk, bcftools and a custom python script to perform these test. The vcftests subworkflow checks for the following criteria, which are also summarized in the docs/output.md:

criteria	log-level	description	tool
VCF file format	ERROR	Checks if general structure of vcf file is adherent to VCF file format.	`GATK4 ValidateVariants`
uncompressed or bgzip compressed	ERROR	Checks if file is either uncompressed or bgzip compressed. gzipped files with `.vcf.gz` ending give an error during indexing.	`bcftools index`
single-sample VCF	ERROR	multi-sample VCF files are not supported.	`bcftools stats`
"chr" prefix in CHROM column	ERROR	Checks if each chromosome column in the vcf file contains the "chr" prefix.	`python script`
matching to reference genome	ERROR	Checks if provided VCF file matches the provided FASTA reference genome. Especially can differentiate between GRCh37 and GRCh38. If left-alignment of indels is activated, `bcftools norm` also checks the reference genome.	`GATK4 ValidateVariants`
only passed filters	WARNING	Checks if the FILTER column contains entries other than "PASS" or ".". NOTE: These can be removed with the the `filter_pass` parameter in the vcfproc module.	`python script`
no-ALT entries	WARNING	Genomic VCF files (gVCFs) are supported but can dramatically increase the runtime of VEP.	`bcftools stats`
no multiallelic sites	WARNING	Checks if the VCF file contains multiallelic variants and gives a warning. NOTE: These will be automatically split wiht `bcftools norm` in the vcfproc module.	`bcftools stats`
contains other variants than SNVs and InDels	WARNING	Checks if VCF file contains other variants.	`bcftools stats`
previous VEP annotation present	WARNING	Checks if previous VEP annotation is present by checking for VEP in the header and if INFO column already contains a CSQ key.

I also added indexing of VCF and FASTA files and creating a sequence dictionary for the reference FASTA file.

VCF filter and norm

I reorganized this into a separate subworkflow vcfproc. It composes two parts: The first includes bcftools view for optional filtering variants based on the FILTER column entry. The other includes bcftools norm and performs splitting of multi-allelic into biallelic variants, which is required by vembrane. I also added two parameters to the pipeline: --filter_vcf: If null, no filtering is applied. If set to STRING, only FILTER column entries passing STRING will be included (e.g. PASS). Default is null. --left-align-indels: If true, perform left-alignment of Indels using bcftools norm. When this is enabled, it also add an additional reference genome check by bcftools norm.

Missing features

Testing and filtering by target regions: This is best implemented once the TMB-branch is merged, as this also includes the bedfile parameters.
Implement filter for rmeving nonvariant entries in gVCF files. This speeds up the annotation process. These variants could optionally be added to final output again, but not annotated to improve runtime.

cio-abcd / variantinterpretation

Add variant qc #27

VCF tests

VCF filter and norm

Missing features