brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
365 stars 56 forks source link

flag processing issue #86

Closed jgrady-omico closed 5 years ago

jgrady-omico commented 6 years ago

I've encountered an issue annotating a vcf using the Cosmic coding mutations VCF - I can't get the 'SNP' flag to annotate correctly. It's the only flag I've tried to annotate but I can't see to get it to work as I would expect.

Here is an extract of the cosmic file, with the header and a specific variant in KRAS - there are two entries for it. Neither of them have the SNP flag set.

fileformat=VCFv4.1

source=COSMICv84

reference=GRCh37

fileDate=20180213

comment="Missing nucleotide details indicate ambiguity during curation process"

comment="URL stub for COSM ID field (use numeric portion of ID)='http://grch37-cancer.sanger.ac.uk/cosmic/mutation/overview?id='"

comment="REF and ALT sequences are both forward strand

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

12 25398284 COSM1135366 C T . . GENE=KRAS_ENST00000256078;STRAND=-;CDS=c.35G>A;AA=p.G12D;CNT=1091 12 25398284 COSM521 C T . . GENE=KRAS;STRAND=-;CDS=c.35G>A;AA=p.G12D;CNT=14473

If I used the following conf file:

[[annotation]] file="CosmicCodingMuts.vcf.gz" fields = ["ID", "CNT", "SNP"] ops=["concat", "max", "flag"] names=["cosm", "cosm_cnt", "cosm_snp"]

The annotation for this variant is: cosm=COSM1135366,COSM521;cosm_cnt=14473;cosm_snp

So, although the flag is not set in the annotation file, it gets applied to the mutations. In fact, it gets applied to every line from the annotation vcf, ignoring whether the flag is actually set on the line or not.

I tried this as well (as I wasn't sure what would happen if there were two conflicting lines for the same variant, one with the flag set and one without):

[[annotation]] file="CosmicCodingMuts.vcf.gz" fields = ["ID", "CNT", "SNP"] ops=["concat", "max", "count"] names=["cosm", "cosm_cnt", "cosm_snp"]

This results in the following: cosm=COSM1135366,COSM521;cosm_cnt=14473;cosm_snp=2

Here, I would expect the answer to be cosm_snp=0, or blank. For every mutation, the cosm_snp is set to the number of lines in the annotation file for that variant, irrespective of whether the flag is set on those lines.

I've also tried uniq and self, none of which return different results for flag set vs not flag set annotation lines.

It feels like a bug... but I may be doing something wrong, hopefully you can help!

Also... if I do apply 'count' to a flag (which feels like a meaningful thing to do - I'm not sure how 'flag' would work for multiple lines) - the vcf type is Number=0,Type=Float. It feels like this should be altering this to Number=1,Type=Float.

Thanks,

John

brentp commented 6 years ago

I'll have to delay looking at this for a couple of weeks, but it's on my radar. Thanks for reporting.

jgrady-omico commented 6 years ago

No worries, thanks Brent.

brentp commented 5 years ago

hi, this is working as expect. flag allows you to get the presence of the variable in another file. the field that you pull is just a place-holder.