Open RoanKanninga opened 6 years ago
I see what you mean. It should use CADD,Number=A
. I would simply change your whole.vcf.gz to use Number=A
for raw
. I would accept a PR to make vcfanno
detect a case like this (though I'm not sure how because the header is written before any variants are observed), but I do not intend to fix myself since this is an edge-case that is easily avoided and/or fixed.
The other "fix" would be to simply set Number=A
whenever op=self
but I don't like that solution either. I think what's there now is a good trade-off with usability (getting a single number in most cases) and completeness. I am open to hearing other ideas.
What about printing a warning when writing multiple values when the (previously written) header was Number=1
? I ask mostly because of the other point raised in this issue (the "scores have been flipped" mentioned), namely when the VCF with the annotations has CADD,Number=1
the variants get annotated as CADD_SCALED=24,0.6;CADD=-0.3,3
, but when you change it to CADD,Number=A
the CADD
scores get flipped to the (correct) CADD_SCALED=24,0.6;CADD=3,-0.3
(-0.3 and 3 are in a different order). A warning might be nice since having the scores not in the same order as the ALTs is probably not an expected outcome of having Number=1
.
somehow I missed that the alleles were flipped. that is indeed a bug. I'm looking into this and the other issue raised by @RoanKanninga now.
After much messing about, this is going to have to be indicated as a WARNING. I thought I could magically adjust the order, but this changes the behavior in cases where Number=1 is actually what is desired. I'll push a fix shortly once I have the other issue resolved.
Hi Brent, thanks for all the work. My example is probably a bit confusing, since the CADD has 2 different scores for the 2 ALT alleles.
What I really want is just one value for my Number=1 field My real case is this: Header: AN,Number=1
1 123456 . A C,G AN=24,24
INFO field called AN, that should always be Number=1, since this is the total amount of all the alleles. But what I now see in my data is e.g. AN=24,24 instead of AN=24. My annotations source file contains (like the cadd annotations file) for each ALT allele an AN value like this: 1 123456 . A C AN=24 1 123456 . A G AN=24
So this is my real problem. a downstream inhouse tool we are using is complaining that 24,24 is not an INT (and he is correct since it expects 1 value and not multiple), I can make a workaround for this to make the AN field Number=A, but that is suboptimal
Can you use "first" instead of "self" in the ops field of the config file? That should grab only a single value instead of multiple.
some of this addressed in the latest release.
This one is quite complex to explain, so i will start with an example This is in my header CADD,Number=1 CADD_SCALED,Number=A
When I have a multiallelic variant let say: 1 208063100 rs5780411 G GA,T
I would expect that CADD_SCALED has two values and CADD only one value. This is correct when my file with the CADD/CADD_SCALED scores only contains this position once, when (in case of the cadd scores you will get scores for each ALT allele) you have multiple lines containing the same position but different ALT alleles it is going all wrong. although CADD,Number=1, the CADD info field has now 2 values (for each ALT allele), and the scores has been flipped, the CADD score for ALT allele 1 has now the value of ALT allele 2 and vice versa
I included: input(input.vcf), output(annotated.vcf), conf(conf.toml) and annotationsfile (whole.vcf.gz + index) vcfAnno.tar.gz