cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
283 stars 49 forks source link

allel.read_vcf RuntimeError: VCF file is missing mandatory header line ("#CHROM...") #376

Open nathanvranken opened 2 years ago

nathanvranken commented 2 years ago

When trying to read some data from a vcf file of simulated data using msprime, I keep running into the same error.

Here is the head of the vcf file (20220112_msprime_length10000000.0_split100000.0_gene_flow100.0_chr2.vcf.gz):

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=tskit 0.2.3
##contig=<ID=2,length=10000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S.0     S.1     S.2     S.3     S.4     I.0     I.1     I.2     I.3     I.4     M.0     M.1     M.2     M.3     M.4     C.0     C.1     C.2     C.3     C.4
2       785     .       0       1       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     1|1     1|1     1|1     1|1     1|1
2       1002    .       0       1       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0

When running the following code: allel.read_vcf(vcf, fields=['calldata/GT'], samples=None, alt_number=1, region='2:0-10000', tabix='tabix')

I consistently get the following error: RuntimeError: VCF file is missing mandatory header line ("#CHROM...")

Does anyone have an idea what might be causing this error?

hardingnj commented 2 years ago

Hi @nathanvranken ,

I'm able to copy and paste that into a text file, bgzip, index then load using allel.read_vcf without any problem.

Things to try:

mckeowr1 commented 2 years ago

I ran into the same issue and was able to resolve it by re-zipping with bcftools view $vcf -output-type z --output-file $new. I did notice that if I did not create a new .tbi and .cbi index it gave me the same error. It'd be interesting to know when the index files are used.

alimanfoo commented 2 years ago

Hi @nathanvranken, @mckeowr1, this may possibly be caused by having an index file (.tbi) which is older than the VCF file. In that case, tabix issues a warning to stderr, and then proceeds. However, currently scikit-allel sends stderr to stdout, so this warning gets parsed as if it was within the VCF file, and interrupts the usual parsing of headers. The offending line is here.

A workaround is to touch the index file so it's newer than the VCF.