brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
365 stars 56 forks source link

INFO fields with multiple values #7

Closed sigven closed 9 years ago

sigven commented 9 years ago

Hi,

I just played with your tool, great work:) Looking at the result from a test I did, annotating ~ 100,000 variants against 6-7 other VCF files, there were a few things that caught my attention:

1) If my annotation file had an INFO field with multiple values (i.e. "Number=.", in which multiple values are being comma-separated for each variant), I could not figure out which operation was best to retrieve the complete set of values. I tried 'uniq' and 'concat', but either way it seems vcfanno concatenates the values with the pipe operator ('|'). Would it be possible to get the identical comma-separated as is present in the annotation VCF file?

2) Would you consider adding the meta-information lines concerning the INFO fields of interest that you specify in the configuration file in the result VCF?

brentp commented 9 years ago

I'll have a look at 1. It's difficult to decide what to do there generally, but I agree that concat should keep the comma's if possible.

For 2. There should be new header lines for each annotation that you add. Did you have something else in mind or is that not working for you?

sigven commented 9 years ago

Thanks. With respect to 2) I am afraid that did not work for me (vcfanno version 0.0.7 [built with go1.5beta3]), I see only my query header lines in the output, not the annotation headers.

brentp commented 9 years ago

ok. 2) is fixed on master (but it's currently not buildable without dev branch of some dependencies)

for 1), I'm looking into special-casing concat for multiple-value fields. (comma-separated). For stuff like mean/max/ etc, It would have to pull the numbers. But, I'm still thinking about the best way to do this.

brentp commented 9 years ago

@sigven this should be resolved. I'm chasing down a few more things before 0.8 release, but if you're on 64 bit linux and could give this a try, that would be very helpful; here is the executable:

https://www.chpc.utah.edu/~u6000771/vcfanno_08

sigven commented 9 years ago

@brentp I made a test now, used ClinVar as a query VCF, and ran annotation against ExAC and DoCM.

A couple of notes:

1) I receive an error when i get to chromosome 8: index: no reference. Can't seem to understand if this is an error on my side or not. 2) I would really like the INFO headers coming from my annoations to be kept as they are; currently i loose the informative Description as it is being changed to "calculated by concat of overlapping values in field" etc. 3) Info tags with multiple values (i.e. Number=.) are comma-separated in the original annotation VCFs. In the output produced by vcfanno_08, they appear in brackets separated by space, e.g.

in annotation source VCF: DOCM_DISEASE=chronic_myeloid_leukemia,acute_myeloid_leukemia;DOCM_PMIDS=23634996, 23656643

in output VCF from vcfanno_08: DOCM_DISEASE=[chronic_myeloid_leukemia acute_myeloid_leukemia];DOCM_PMIDS=[23634996 23656643]

brentp commented 9 years ago

I'm trying to write a test for this. Can you share your conf file?

brentp commented 9 years ago

OK. I found what you mean. I've made that part of the code less fragile and fixed the issue you describe.

For the header, what you're requesting is different than what I had in mind, but I'll have a look.

brentp commented 9 years ago

An executable that fixes the multiple values problem is here: https://www.chpc.utah.edu/~u6000771/vcfanno_08a1

sigven commented 9 years ago

OK, just tested vcfanno_08a1. The use of comma instead of brackets now works. Thanks!

The index error still puzzles me, though. Do you believe that's an error wrt. my query VCF? Variant that fails is chr8:g.1712049C>T (first variant on chromosome 8 in my query VCF, which from what I can judge is a valid variant).

I understand that the header issue is not as straightforward as I imagined, taking all the various operations you offer into consideration. I can probably make a workaround so that it suits my needs.

brentp commented 9 years ago

I temporarily overlooked your index error. It could be that chr8 does not exist in one of your annotation files. Obviously, it should fail on that. I'll have a fix in a few hours.

The h

brentp commented 9 years ago

... I mean "should not fail on that" ...

brentp commented 9 years ago

updated binary here: http://home.chpc.utah.edu/~u6000771/vcfanno_08a2

brentp commented 9 years ago

that fixes the index not found problem, still thinking about the header.

brentp commented 9 years ago

did you have a chance to check http://home.chpc.utah.edu/~u6000771/vcfanno_08a2 ? I'd like to release 0.8, but it has a lot of new changes so it'd be good to get your feedback.

sigven commented 9 years ago

Hmm.. after the header gets printed, I am getting an error which is not too informative: 2015/10/21 08:33:21 gzip: invalid header

I suspect one of my VCF files has a problem, but it's hard to assess what is wrong.

brentp commented 9 years ago

with the new version of vcfanno, everything has to be bgzipped and tabixed. I'll look into the message.

On Wed, Oct 21, 2015 at 12:39 AM, Sigve Nakken notifications@github.com wrote:

Hmm.. after the header gets printed, I am getting an error which is not too informative: 2015/10/21 08:33:21 gzip: invalid header

I suspect one of my VCF files has a problem, but it's hard to assess what is wrong.

— Reply to this email directly or view it on GitHub https://github.com/brentp/vcfanno/issues/7#issuecomment-149796510.

brentp commented 9 years ago

I just tagged a new release here: https://github.com/brentp/vcfanno/releases/tag/v0.0.8 that has a better error message for the case you describe.

sigven commented 9 years ago

great work @brentp ! Works very good now.

Now I am only having trouble with one VCF file (runtime error), I suspect that it has to do with the size of an INFO tag value (this often exceed 100 characters), is there a limitation for this in your vcf reader?

panic: runtime error: index out of range

goroutine 25 [running]: github.com/brentp/vcfgo.(*Reader).Parse(0x1985ac20, 0x1bc49500, 0x7, 0xa, 0x0, 0x0) /usr/local/src/gocode/src/github.com/brentp/vcfgo/reader.go:202 +0x6aa

brentp commented 9 years ago

can you send the full traceback?

brentp commented 9 years ago

By the line number, that should only happen if your vcf has too few fields for a given line.

brentp commented 9 years ago

I put a binary here: http://home.chpc.utah.edu/~u6000771/vcfanno_081 that will output a more informative error message for the line that's causing the error.

sigven commented 9 years ago

My bad. One of my annotation VCF files had inherent format errors.

brentp commented 9 years ago

no problem. I want it to have informative messages even when it borks... I'll close for now. Let me know of any other issues.