As discussed in our meeting from 30th March 2023, we should already think about implementation of quality control and filtering routines for the input VCF files. This Issue can be used as thread to collect ideas for future implementations.
Some ideas:
Check if structure of VCF file is valid according to VCF version 4.2 specifications. First, check for the standard columns and the VCF HEADER.
Check, if the input is a GVCF also containing non-mutated regions. These could slow down the VEP annotation process dramatically, so maybe we include the option to only annotate positions with different REF and ALT alleles, but include all VCF entries into the reporting module.
Check FILTER columns for flags. Give warnings, if user provides unfiltered VCF files and add options to filter for expressions in the FILTER flag.
Check INFO field for VEP-based CSQ strings. If input files are already annotated with VEP or another program, the user should be able to configure the pipeline to either (A) start after VEP annotation or (B) rerun the annotation with VEP. Option A would need additional checks in the annotation to ensure downstream process compatibility.
Optional check, if VCF file covers a supplied target region file. Include optional filtering before VEP annotation for those target regions.
Check for the used reference genomes. If it does not match the reference configured for VEP annotation, exit with error.
Optional, variant-caller specific checks. For example, some variant caller report different metrice sin the FORMAT fields that could be handled differently.
Description of feature
As discussed in our meeting from 30th March 2023, we should already think about implementation of quality control and filtering routines for the input VCF files. This Issue can be used as thread to collect ideas for future implementations.
Some ideas: