Unable to load vcf file

cyrusmallon commented 1 year ago

Hello,

I've am trying to load a vcf file that was produced using breseq (v 0.37.1), but it fails to load with the function readVcf().

Here is the beginning of the vcf file that breseq produces (full file as .txt here original.txt) , which I have tried to load unsuccessfuly with readVcf():

##fileformat=VCFv4.1
##fileDate=20230205
##source=breseq_GD2VCF_converterter
##contig=<ID=NC_000913,length=4639675>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AD,Number=1,Type=Float,Description="Allele Depth (avg read count)">
##INFO=<ID=DP,Number=1,Type=Float,Description="Total Depth (avg read count)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO 
NC_000913   23389   .   C   A   458.0   PASS    AF=1.0000;AD=123;DP=123
NC_000913   59663   .   G   T   353.4   PASS    AF=1.0000;AD=94;DP=94
NC_000913   129835  .   G   T   529.3   PASS    AF=1.0000;AD=142;DP=143
NC_000913   161588  .   C   A   475.5   PASS    AF=1.0000;AD=126;DP=127
NC_000913   177780  .   C   A   407.5   PASS    AF=1.0000;AD=107;DP=107
NC_000913   192607  .   G   T   465.1   PASS    AF=1.0000;AD=124;DP=124
NC_000913   192734  .   G   T   565.5   PASS    AF=1.0000;AD=148;DP=148

This is the error message I receive:

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO 

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'seqinfo': no 'header' line "#CHROM POS ID..."?

After receiving this error message, I simply used awk to create a new FORMAT column and populated that column with the string 'GT' for genotype. I also made sure everything was tab separated. Here is what the beginning of the vcf file looks like (full file as .txt here newfile1.txt):

##fileformat=VCFv4.1
##fileDate=20230205
##source=breseq_GD2VCF_converterter
##contig=<ID=NC_000913,length=4639675>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AD,Number=1,Type=Float,Description="Allele Depth (avg read count)">
##INFO=<ID=DP,Number=1,Type=Float,Description="Total Depth (avg read count)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
NC_000913   23389   .   C   A   458.0   PASS    AF=1.0000;AD=123;DP=123 GT
NC_000913   59663   .   G   T   353.4   PASS    AF=1.0000;AD=94;DP=94   GT
NC_000913   129835  .   G   T   529.3   PASS    AF=1.0000;AD=142;DP=143 GT
NC_000913   161588  .   C   A   475.5   PASS    AF=1.0000;AD=126;DP=127 GT

But again I get the same error:

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument '

Then I saw this post, https://github.com/gerstung-lab/deepSNV/issues/17, which says that the FORMAT column needs to be followed by another colum. Therefore, I simply added a column I just called "EXTRA" and populated it with the text "COL". Here's what the beginning of the vcf looks like (full file as .txt here newfile2.txt):

##fileformat=VCFv4.1
##fileDate=20230205
##source=breseq_GD2VCF_converterter
##contig=<ID=NC_000913,length=4639675>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AD,Number=1,Type=Float,Description="Allele Depth (avg read count)">
##INFO=<ID=DP,Number=1,Type=Float,Description="Total Depth (avg read count)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  EXTRA
NC_000913   23389   .   C   A   458.0   PASS    AF=1.0000;AD=123;DP=123 GT  COL
NC_000913   59663   .   G   T   353.4   PASS    AF=1.0000;AD=94;DP=94   GT  COL
NC_000913   129835  .   G   T   529.3   PASS    AF=1.0000;AD=142;DP=143 GT  COL
NC_000913   161588  .   C   A   475.5   PASS    AF=1.0000;AD=126;DP=127 GT  COL
NC_000913   177780  .   C   A   407.5   PASS    AF=1.0000;AD=107;DP=107 GT  COL
NC_000913   192607  .   G   T   465.1   PASS    AF=1.0000;AD=124;DP=124 GT  COL
NC_000913   192734  .   G   T   565.5   PASS    AF=1.0000;AD=148;DP=148 GT  COL

But again I get the same error:

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO 

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'seqinfo': no 'header' line "#CHROM POS ID..."?

Does anyone know why I can't upload these vcf files with readVcf()? Does something stick out in the format of the file? Or is there possibly some bug with readVcf() ?

Thank you.

vjcitn commented 1 year ago

you did a lot of work. I think the problem is a stray blank character at the end of the #CHROM line. Once I remove that readVcf works.

vjcitn commented 1 year ago

thanks for the thorough report.

vjcitn commented 1 year ago

How did I find it out? I ran the example of readVcf, then did a readLines on the example file and on your excerpt of the problematic file. I noticed that the #CHROM lines differed as noted. A sad lack of resilience in the parsing infrastructure.

cyrusmallon commented 1 year ago

Thank you so much for your quick reply and solution! I didn't know about the readLines() function, but I just ran it and see the same space you did! After taking the space away vcf now loads.

I will close the issue now.

Thanks again!

sandra-biology commented 1 year ago

you did a lot of work. I think the problem is a stray blank character at the end of the #CHROM line. Once I remove that readVcf works.

Hi, could you clarify what you mean by a stray blank character at the end of the #CHROM line? I'm having my own problems with bcftools not parsing header and think this might be my solution.

cyrusmallon commented 1 year ago

Hi,

If you're working with R you can use the readLines() command to look at all input lines of your vcf file, line by line (same thing as row by row). Your output should be something like this:

#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER

Notice that this is a tab separated file and there are no spaces anywhere, only the \t denoting the tab separation.

If you happen to see something like this:

#CHROM\tPOS\tID\tREF\tALT\tQUAL\t FILTER

Where there is a weird space somewhere, then it may be difficult to parse your vcf file. Also take note of the error messages. If I remember correctly, if there is an "INFO" column, then it must be proceeded by some columns with metadata (something to that effect). Also, if you look up the current vcf guidelines, the formats of vcf files are standardized. So perhaps your vcf file is somehow different than the standard format and for that reason it cannot be parsed/loaded.

cyrusmallon commented 1 year ago

I see you're working in python. In python you can use readlines()

Bioconductor / VariantAnnotation

Unable to load vcf file #69