brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
365 stars 56 forks source link

vcfanno gzip IO related errors (race condition with multiple threads to bgzf.Reader) #64

Closed chapmanb closed 7 years ago

chapmanb commented 7 years ago

Brent; We've incorporated vcfanno into bcbio with a ton of success. It's been awesome to have general flexibility for annotation. Now that we're starting to test at scale we've been seeing intermittent issues with reading VCF files. These appear to be IO related issues from the error messages and aren't reproducible -- the files themselves are fine and just re-running the same command works.

I've been trying to collect error cases and the issue is reported after:

vcfanno version 0.2.4 [built with go1.8]

vcfanno.go:115: found 1 sources from 1 files
vcfanno.go:143: using 2 worker threads to decompress query file
api.go:670: vcfanno: using ~2 workers per file

We then see errors and a failed command with these errors:

parallel.go:151: gzip: invalid header

or

parallel.go:151: short buffer

I know this is not a great report but I don't have much more to go on from my side. Do you know if there are ways we could make vcfanno more resilient to IO/read issues? Thanks for any pointers or ideas to tackle.

brentp commented 7 years ago

How frequently do these occur? I have an idea about what to change, if I send a binary, could you test and have a good idea if it's been resolved?

chapmanb commented 7 years ago

It's pretty infrequent and only under high load but we have a couple of ongoing projects where it happens more regularly (100s of samples on AWS EBS volumes). I can also just pull in a new version for bioconda and push to see if we still see them intermittently. Sorry to not have a reproducible case or anything useful. Thanks for thinking about this.

brentp commented 7 years ago

can you show the conf file you're using?

I added some decoration to the error messages I'll make a new release ASAP and then you should have more context on the error so I can dig further.

chapmanb commented 7 years ago

Brent; Brilliant, thanks for doing that. The configuration files where we've seen this most often is not very complex, just annotating with dbSNP:

[[annotation]]
file="/path/to/dbsnp.vcf.gz"
fields=["ID"]
names=["rs_ids"]
ops=["concat"]

Thanks again.

brentp commented 7 years ago

Hi Brad, I made a new release that has better error messages. Can you give it a try? That will help me to narrow it down.

https://github.com/brentp/vcfanno/releases/tag/v0.2.5

Also, if you can get a reasonably reproducible error, then you could try the vcfanno_linux64_race binary which would give more info. I'm assuming there's some sort of race condition going on but haven't been able to track it down. Running under the race binary will be > 10X slower, so don't use that in production.

brentp commented 7 years ago

scratch that. I just found the race. I'll remove that release and fix.

brentp commented 7 years ago

Brad, that is fixed in this release: https://github.com/brentp/vcfanno/releases/tag/v0.2.6

I'll leave this issue open as we should be able to have parallel decompression, but I couldn't track down the cause so I just have single-threaded compression for each file (but vcfanno will still run chunks in parallel).

chapmanb commented 7 years ago

Brent; Thanks for identifying the underlying issue and saving the back and forth. I also bumped the bioconda package for this and will let you know if we spot anything else at all. Awesome work spotting the underlying issue so quickly.

brentp commented 7 years ago

this has also been fixed upstream in biogo/hts/bgzf so the next release will restore the multiple decompression threads per annotation.