chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

Does not recognize "R" format in VCF header #33

Closed huguesfontenelle closed 6 years ago

huguesfontenelle commented 6 years ago

The VCF spec 4.2 has a value R in the FORMAT field, in the header defined as:

This input string crashes bcbio.variation though:

Caused by: java.lang.NumberFormatException: For input string: "R"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:492)
    at java.lang.Integer.valueOf(Integer.java:582)
    at htsjdk.variant.vcf.VCFCompoundHeaderLine.<init>(VCFCompoundHeaderLine.java:171)
    at htsjdk.variant.vcf.VCFFormatHeaderLine.<init>(VCFFormatHeaderLine.java:49)
chapmanb commented 6 years ago

Hugues; Sorry about the issue. bcbio.variation uses an older version of GATK which is not up to date with all of the new 4.2 isms. What functionality in bcbio.variation are you using? We've migrated away from some of these tools to new approaches so I might be able to suggest a workaround for what you're doing. Thanks much.

huguesfontenelle commented 6 years ago

Hi Brad, I'm using variant-compare in a regression test. Actually I am aware that older GATK versions do not support it (ie the 3.3 that we're still using), because a coworker just patched the output of the variant caller with something along the line of sed -i 's/ID=AD,Number=\./ID=AD,Number=R/g' $VCF. That change broke the regression testing which uses bcbio.variation . Since I wanted a release of my pipeline, I did the sed backwards again. Dirty tricks you see.

It makes sense that bcbio.variation is compatible with the GATK it was designed to support. I guess that I'll let it be for the time being, but soon we'll be moving to GATK4. Which comparison tool do you suggest then? Thank you!

chapmanb commented 6 years ago

Hugues; Thanks for the background. Definitely understood, I've used that sed trick as well so understand completely.

For variant comparisons, I'd recommend using either rtg vcfeval (https://github.com/RealTimeGenomics/rtg-tools) or hap.py (https://github.com/Illumina/hap.py) which do a much better job than bcbio.variation. They do local resolution of haplotypes for better comparisons in tricky regions with multiple variants. We've moved to using those over what was implemented in bcbio.variation.

Hope this works for what you'd need. Thanks again for the discussion.

huguesfontenelle commented 6 years ago

Excellent. Thank you!