brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Vcfanno does not use pipes to delimit multiple annotations for a single ALT allele #114

Open ptn24 opened 5 years ago

ptn24 commented 5 years ago

The by_alt operation should use pipes (perhaps this could be parameterized) to delimit multiple annotations for a single ALT allele. However when adding BED annotations, vcfanno seems to use commas to delimit annotations

root@job-FZzyJbj03gG7Y2bZGzK4GP39:/tmp# zcat chr1.vcf.gz 
##fileformat=VCFv4.2
##hailversion=0.2.9-8588a25687af
##contig=<ID=1,length=249250621,assembly=GRCh37>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10177   rs367896724     A       AC      .       .       .
root@job-FZzyJbj03gG7Y2bZGzK4GP39:/tmp# zcat ENCFF171LNJ.sorted.bed.gz
chr1    10135   10285   .       0       .       28      -1      -1      75
chr1    10175   10325   .       0       .       20.0    -1      -1      75
root@job-FZzyJbj03gG7Y2bZGzK4GP39:/tmp# cat by-alt.conf.toml 
[[annotation]]
names = [ "ENCFF171LNJ",]
file = "/tmp/ENCFF171LNJ.sorted.bed.gz"
columns = [ 7,]
ops = [ "by_alt",]
root@job-FZzyJbj03gG7Y2bZGzK4GP39:/tmp# vcfanno by-alt.conf.toml chr1.vcf.gz 

=============================================
vcfanno version 0.3.1 [built with go1.11]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 1 sources from 1 files
vcfanno.go:143: using 2 worker threads to decompress query file
##fileformat=VCFv4.2
##contig=<ID=1,length=249250621,assembly=GRCh37>
##INFO=<ID=ENCFF171LNJ,Number=A,Type=String,Description="calculated by by_alt of overlapping values in column 7 from /tmp/ENCFF171LNJ.sorted.bed.gz">
##hailversion=0.2.9-8588a25687af
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
1       10177   rs367896724     A       AC      .       .       ENCFF171LNJ=28,20.0
vcfanno.go:241: annotated 1 variants in 0.00 seconds (3213.9 / second)

Expected INFO to equal ENCFF171LNJ=28|20.0

If you have encountered an error, please include:

brentp commented 5 years ago

this is an oversight and therefore a deficiency in vcfanno, but it doesn't make sense to use by_alt on a bed file (where you don't have ref and alt columns to indicate the exact allele).

ptn24 commented 5 years ago

That makes sense. If the INFO tag is for the whole locus though, then would it be possible to make the metadata line for INFO/ENCFF171LNJ say Number=. (cf. https://samtools.github.io/hts-specs/VCFv4.2.pdf)? It could also be useful to add a line to the documentation and/or print a warning to stdout about BED annotations (just a thought)

Alternatively, what do you think about duplicating the annotations across ALT alleles when users pass in by_alt + BEDs? Not ideal, but users would have control

brentp commented 5 years ago

i think it should probably be an error to use by_alt with a file that doesn't have ref, alt. why don't you use op of concat?

ptn24 commented 5 years ago

Good suggestion, will do