brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Malformed VCF being generated -- malformed headers and whitespace #136

Closed damioresegun closed 3 years ago

damioresegun commented 3 years ago

Hi, Following the example here: https://brentp.github.io/vcfanno/examples/gff_sv/gff_sv/; I managed to carry out annotation of my vcf file using a gff file for the same isolate. Conf and lua files used are attached. In the conf file, multiple "names" were added because the gff file identifies genes using different terms based on the sources used within it. Command used was: vcfanno -p 32 -ends -lua gff_sv.lua gff_sv.conf isolatename.vcf > Annotatedisolate.vcf Everything worked well, I saw the annotations present and was about to validate by visualising them on IGV and got an error:

"Error loading /home/doresegu/scratch/private/PublicationAnal/AnnotateVCF/sks047Assem.vcf: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /home/doresegu/scratch/private/PublicationAnal/AnnotateVCF/sks047Assem.vcf"

I tried another file and got another error: "Error loading /home/doresegu/scratch/private/PublicationAnal/AnnotateVCF/sks047ONT.vcf: The provided VCF file is malformed at approximately line number 70: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "PRECISE;SVTYPE=DEL;SVLEN=-47;END=69669;CIPOS=-26,26;CILEN=-9,9;RE=22;RNAMES=e67442f0-666c-4c58-a4dc-2d8a56948018,12e39d4c-e6fa-4aef-97d7-b665468a2d04,28e77da8-681d-4cca-bb7d-7719bac5a260,4e01192e-9864-4415-ac76-21db47b8ec8b,50cb02b2-7936-49d4-8d95-2029319d3642,99a31c50-fbb9-49b3-957b-5e7894d5f5cf,9c3d4539-8700-477b-b42d-0a8beb401751,9e24b0e8-cea9-4d60-9f22-fb57ac762ad6,d1cd70ee-0671-4cb4-b6fd-28c3939377ee,d512485e-6333-4746-b2b6-cf491c20bf11,df31bab4-5db3-4aac-8e9b-29dfa3cb3e19,e1066498-9cde-4932-a5c4-e67249147249,722f10c2-2d0b-4f00-acf3-fac7fc67adf8,82efd438-dab8-4449-b9c5-bcc6b31bca34,f1bf235f-0f7e-427a-bb92-f73733fd3b3d,82c87593-a1b5-4bac-ad78-7d7c5bd9975b,9b3967b2-8ee8-4d61-8c58-d9c3063ff39d,86b137d3-3b5e-4811-913a-14051bf1a8c6,4b3ac4f8-352a-41ff-bff8-c9d9593b059e,c613309d-3df8-4936-826a-95dea6351ca6,83c87a56-bdb3-4ff7-bbfd-0e6b455a60d1,4af1bd20-ae3c-4f52-9738-225b94915840;STRAND=+-;Gene_name=PKA1H1_STAND_CTG000068;protein_match=;polypeptide_product=term%3Dchitinase%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:PKNH_0100800.1,Dbxref=UniProt:Q76EB4,EC:3.2.1.14,Ontology_term=GO:0004553,GO:0005975;left_Gene_name=PKA1H1_STAND_CTG000068;left_protein_match=;left_polypeptide_product=term%3Dchitinase%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:PKNH_0100800.1,Dbxref=UniProt:Q76EB4,EC:3.2.1.14,Ontology_term=GO:0004553,GO:0005975;right_Gene_name=PKA1H1_STAND_CTG000068;right_protein_match=;right_polypeptide_product=term%3Dchitinase%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:PKNH_0100800.1,Dbxref=UniProt:Q76EB4,EC:3.2.1.14,Ontology_term=GO:0004553,GO:0005975", for input source: /home/doresegu/scratch/private/PublicationAnal/AnnotateVCF/sks047ONT.vcf"

So I used vcf-validator (from vcftools) to check and I got this type of error: seems like it supports the IGV error for the same isolate "Broken VCF header, no column names? at /mnt/shared/scratch/doresegu/apps/conda/envs/vcfanno/lib/site_perl/5.26.2/Vcf.pm line 172, <__ANONIO__> line 2. Vcf::throw(Vcf4_1=HASH(0x558de1de9320), "Broken VCF header, no column names?") called at /mnt/shared/scratch/doresegu/apps/conda/envs/vcfanno/lib/site_perl/5.26.2/Vcf.pm line 867 VcfReader::_read_column_names(Vcf4_1=HASH(0x558de1de9320)) called at /mnt/shared/scratch/doresegu/apps/conda/envs/vcfanno/lib/site_perl/5.26.2/Vcf.pm line 602 VcfReader::parse_header(Vcf4_1=HASH(0x558de1de9320)) called at /mnt/shared/scratch/doresegu/apps/conda/envs/vcfanno/lib/site_perl/5.26.2/Vcf.pm line 2557 VcfReader::run_validation(Vcf4_1=HASH(0x558de1de9320)) called at /mnt/shared/scratch/doresegu/apps/conda/envs/vcfanno/bin/vcf-validator line 60 main::do_validation(HASH(0x558de1818220)) called at /mnt/shared/scratch/doresegu/apps/conda/envs/vcfanno/bin/vcf-validator line 14"

**I check with the untouched query file and it loaded onto IGV fine. I also attempted to find a vcf fixing tool but can't seem to find anything. Do you have an idea of why the whitespace is being added and why its saying that the CHROM column is missing even though it is present? The header and first 20 lines of the VCF are also attached.

Hopefully I have given you enough to troubleshoot? Please and thanks**

gff_sv.lua.txt gff_sv.conf.txt portion.vcf.txt Cultured_sorted.gff3.gz

If you have encountered an error, please include:

brentp commented 3 years ago

I'm not sure what's the problem I did this:

$ vcfanno -lua gff_sv.lua.txt gff_sv.conf.txt  portion.vcf.txt  > o.vcf

=============================================
vcfanno version 0.3.2 [built with go1.13.1]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 3 sources from 1 files
vcfanno.go:156: falling back to non-bgzip
vcfanno.go:248: annotated 30 variants in 0.03 seconds (1157.7 / second)
(base) brentp@lorax:~/Downloads$ bcftools view o.vcf > /dev/null 
[W::vcf_parse] Contig 'PKA1H1_STAND_01' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'PKA1H1_STAND_02' is not defined in the header. (Quick workaround: index the file with tabix.)
(base) brentp@lorax:~/Downloads$ grep CHROM o.vcf 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ../../sks047VsPkA1H1/Real50_Sks047VsPkA1H1.Assemblytics_structural_variants.bed

you are pulling fields from the GFF that have spaces that might be causing the problem. If you want to remove those, you can add that to your lua code.

damioresegun commented 3 years ago

Yeah I think its because there are spaces in the field of the GFF. Is there a way to remove/substitute the whitespace using lua?

brentp commented 3 years ago

I'd just use gsub: https://stackoverflow.com/a/10460780

damioresegun commented 3 years ago

Thanks! I'll give it a go and hopefully I get it. I've never used lua before so this will be somewhat of trial and error

brentp commented 3 years ago

I think you can just do:

keys[#keys+1] = k:gsub("%s+", "")
damioresegun commented 3 years ago

Haha I see! Thanks for this. I actually did: name = info:sub(e + 1) name = string.gsub(name, "%s+", "") s, e = string.find(name, ";")

This seemed to work.

I tried your suggestion and got the same results. I'll stick with yours though.

Thank you for the very rapid response. Especially on a friday! Closing this now