TRON-Bioinformatics / covigator-ngs-pipeline

A Nextflow pipeline for NGS variant calling on SARS-CoV-2. From FASTQ files to normalized and annotated VCF files from GATK, BCFtools, LoFreq and iVar.
MIT License
17 stars 7 forks source link

Produced VCFs are claimed to be malformed by IGV #57

Open priesgo opened 10 months ago

priesgo commented 10 months ago

When trying to load a VCF in IGV it gives the following error message:

The provided VCF file is malformed at approximately line number 69: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "DP=29;AF=0.103448;SB=0;DP4=13,13,1,2;INDEL;HRUN=5;ANN=C|frameshift_variant|HIGH|ORF1ab|gene-GU280_gp01|transcript|TRANSCRIPT_gene-GU280_gp01|protein_coding|1/1|c.10122delT|p.S3376fs|10122/21290|10122/21290|3374/7095||WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS;LOF=(ORF1ab|gene-GU280_gp01|1|1.00);CONS_HMM_SARS_COV_2=0.57215;CONS_HMM_SARBECOVIRUS=0.57215;CONS_HMM_VERTEBRATE_COV=0;PFAM_NAME=Peptidase_C30_CoV;PFAM_DESCRIPTION=Peptidase C30,coronavirus;vafator_af=0.103448;vafator_ac=3;vafator_dp=29",

Apparently, the PFAM_DESCRIPTION field does contain white spaces. A possible solution would affect both the pipeline and the processor. The pipeline would need to generate valid VCF. For instance replacing white spaces by under scores. The processor would need to replace back the under scores into white spaces when loading the data into the database. One possible problem in this implementation is that there may be other under scores in INFO fields that we don't want to replace by white spaces.

priesgo commented 10 months ago

Three options at least:

priesgo commented 10 months ago

Fourth option: remove the Pfam long description altogether if not used in the dashboard