cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure
MIT License
7 stars 1 forks source link

Add variant qc #27

Closed sci-kai closed 9 months ago

sci-kai commented 11 months ago

This PR implements tests to ensure only suitable VCF files are running through the pipeline and normalization and filtering steps and addresses issue #12.

VCF tests

It uses a combination of gatk, bcftools and a custom python script to perform these test. The vcftests subworkflow checks for the following criteria, which are also summarized in the docs/output.md:

criteria log-level description tool
VCF file format ERROR Checks if general structure of vcf file is adherent to VCF file format. GATK4 ValidateVariants
uncompressed or bgzip compressed ERROR Checks if file is either uncompressed or bgzip compressed. gzipped files with .vcf.gz ending give an error during indexing. bcftools index
single-sample VCF ERROR multi-sample VCF files are not supported. bcftools stats
"chr" prefix in CHROM column ERROR Checks if each chromosome column in the vcf file contains the "chr" prefix. python script
matching to reference genome ERROR Checks if provided VCF file matches the provided FASTA reference genome. Especially can differentiate between GRCh37 and GRCh38. If left-alignment of indels is activated, bcftools norm also checks the reference genome. GATK4 ValidateVariants
only passed filters WARNING Checks if the FILTER column contains entries other than "PASS" or ".". NOTE: These can be removed with the the filter_pass parameter in the vcfproc module. python script
no-ALT entries WARNING Genomic VCF files (gVCFs) are supported but can dramatically increase the runtime of VEP. bcftools stats
no multiallelic sites WARNING Checks if the VCF file contains multiallelic variants and gives a warning. NOTE: These will be automatically split wiht bcftools norm in the vcfproc module. bcftools stats
contains other variants than SNVs and InDels WARNING Checks if VCF file contains other variants. bcftools stats
previous VEP annotation present WARNING Checks if previous VEP annotation is present by checking for VEP in the header and if INFO column already contains a CSQ key.

VCF filter and norm

I reorganized this into a separate subworkflow vcfproc. It composes two parts: The first includes bcftools view for optional filtering variants based on the FILTER column entry. The other includes bcftools norm and performs splitting of multi-allelic into biallelic variants, which is required by vembrane. I also added two parameters to the pipeline: --filter_vcf: If null, no filtering is applied. If set to STRING, only FILTER column entries passing STRING will be included (e.g. PASS). Default is null. --left-align-indels: If true, perform left-alignment of Indels using bcftools norm. When this is enabled, it also add an additional reference genome check by bcftools norm.

Missing features