brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Missing annotations #94

Closed Thatguy027 closed 5 years ago

Thatguy027 commented 6 years ago

Hi there,

I've added all the requested information and files below.

I am not getting an error, but one of the annotations I am trying to add to my VCF is not getting appended to the INFO filed. I see "AA" appear in the VCF header, which leads me to believe that vcfanno is recognizing the input:

##INFO=<ID=AA,Number=1,Type=String,Description="calculated by self of overlapping values in column 4 from ANC.bed.gz">

The actual configuration file contains more annotations, all of which are successfully being appended to the INFO filed of the VCF.

I'm not exactly sure why this particular bed file is not working. I am trying to append an ancestor allele to my VCF. The bed file I am using to annotate the VCF has a line for every position in the VCF file.

Thanks!

If you have encountered an error, please include:

see: https://github.com/brentp/vcfanno

vcfanno.go:115: found 11 sources from 9 files bix.go:200: chromosome MtDNA not found in /projects/b1059/projects/Stefan/CePopGen-nf/input_files/annotations/gquarted.bed.gz bix.go:200: chromosome MtDNA not found in /projects/b1059/projects/Stefan/CePopGen-nf/input_files/annotations/TF_bindingsite_paper.bed.gz bix.go:200: chromosome MtDNA not found in /projects/b1059/projects/Stefan/CePopGen-nf/input_files/annotations/promoter.bed.gz bix.go:200: chromosome MtDNA not found in /projects/b1059/projects/Stefan/CePopGen-nf/input_files/annotations/histone_binding.bed.gz vcfanno.go:241: annotated 29141 variants in 17.30 seconds (1684.0 / second)

] the full error message

brentp commented 6 years ago

Hi, can you show an example of a variant in your VCF that should have been annotated with an entry in ANC.bed.gz (along with that entry)?

It looks like your ANC.bed.gz file has always REF == ALT, so it probably does not match anything in your VCF for that reason.

Thatguy027 commented 6 years ago

Hi,

Thanks for the quick response.

I originally set up the ANC bed file to only have four columns, the fourth column as a diploid genotype (e.g. "T/T"). I thought this might be causing the issues so I split out the diploid genotype to two columns and only used the fourth for annotations. I was under the impression that vcfanno did not attempt to match alleles when annotating with a bed file, is that not the case?

Here are the VCF sites corresponding to the test set:

bcftools query --regions I:3731-28123 -f '%CHROM\t%POS\t%REF\t%ALT\n' input_files/330_TEST.vcf.gz
I   3731    T   A
I   5317    T   A
I   11075   G   A
I   14398   A   G
I   15196   C   T
I   15629   G   A
I   22510   G   A
I   24269   A   T
I   27584   A   T
I   28123   C   G

And here is ancestor at corresponding sites

bcftools query --regions I:3731-28123 --samples XZ1516 -f '%CHROM\t%POS\t%END\t[%TGT\t%GT]\n' input_files/330_TEST.vcf.gz
I   3731    3731    T/T 0/0
I   5317    5317    T/T 0/0
I   11075   11075   G/G 0/0
I   14398   14398   A/A 0/0
I   15196   15196   C/C 0/0
I   15629   15629   G/G 0/0
I   22510   22510   G/G 0/0
I   24269   24269   A/A 0/0
I   27584   27584   A/A 0/0
I   28123   28123   C/C 0/0

S

brentp commented 6 years ago

oh, I think you'll have to add 1 to the end (or subtract 0 from the start, not sure if it is 0 or 1 -based) as vcfanno does not handle 0 length intervals. I did not notice this the first time around

Thatguy027 commented 6 years ago

I thought that might be the case, but was worried about MNPs throwing off the annotations. The VCF is 1-based, does that mean I should add 1 to the end pos?

brentp commented 6 years ago

that depends what bcftools outputs in your case. I suspect you should subtract 1 from the start and then it will be proper bed format.

Thatguy027 commented 6 years ago

Great thanks. I'll look into it to make sure.

PedroBarbosa commented 5 years ago

Dear @brentp ,

It is common to have this issue. Usually I annotate from several files, and constantly one or another annotation fail, despite the fact that the field gets added to the header. Always struggle to understand this behavior, it fails both with general tabular files (properly tabixed) and bed files.

Please find a simple example, where I try to annotate a VCF with a bed file. Conf file:

[[annotation]]
file="/mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/kipoi/kipoisplice/1_kipoisplice4_to_annotate.bed.gz"
columns=[6]
names=["kipoisplice_4"]
ops=["self"]

VCF file bed file

I get the following log:

=============================================
vcfanno version 0.3.1 [built with go1.11]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 1 sources from 1 files
vcfanno.go:143: using 2 worker threads to decompress query file
api.go:804: WARNING: using op 'self' when with Number='1' for '' from '/mnt/nfs/lobo/MCFONSECA-NFS/pedro.barbosa/kipoi/kipoisplice/1_kipoisplice4_to_annotate.bed.gz' can result in out-of-order values when the query is multi-allelic
api.go:805:        : this is not an issue if the query has been decomposed.
vcfanno.go:241: annotated 8 variants in 0.00 seconds (3748.2 / second)

Can you reproduce the error? Best, Pedro

brentp commented 5 years ago

hi, I just added this in a branch and everything seems fine to me. this is just a warning. and the variants are annotated correctly. since you have no header on the bed file (1_kipoisplice4_to_annotate.bed.gz) vcfanno can't know that you have ref and alt fields in there. if you add a header with "ref" and "alt" for those columns, then it can do allele-specific annotation.

PedroBarbosa commented 5 years ago

hi, which version are you using ?

my output vcf lacks that annotation, only adds information in the VCF header. Even though i have no header in the bed file, it should do position-based annotation (all alleles split by ',' ), right ?

Indeed, when I add the header in the bed, it works, but I don't understand these exceptions. For instance, this following file gets properly annotated, and the bed file follows the same format (with no header). 1_deltaLogitPSI_to_annotateVCF.bed.gz

brentp commented 5 years ago

maybe you need to re-index your bed file? i am using version 0.3.2 but i doubt this is broken in any release.

PedroBarbosa commented 5 years ago

dear @brentp ,

It doesn't seem to be related with the index, or there is an issue with my own tabix version (1.2.1)

I transformed the same bed file into tab delimited with a header. Indexed it with tabix -s1 -b2 and tried to annotate the 5th column.. I get empty VCF both at 0.3.1 and 0.3.2 versions. Could you test with my tab delimited and index files (changed tbi to txt to allow the upload)? 1_kipoisplice4_to_annotate.tsv.gz

1_kipoisplice4_to_annotate.tsv.gz.txt

Thank you in advance, Pedro

brentp commented 5 years ago

so, before I look into this new set. you agree that your first test-case works in vcfanno as expected? I want to make sure we're looking at the same problem.

PedroBarbosa commented 5 years ago

I agree, with the header added it works great

brentp commented 5 years ago

ok. so now, the problem is that you've changed to .tsv, indexed with tabix and it doesn't annotate as expected? can you make sure that you use tabix --zero-based if your annotations have bed-like (0-based, half-open) intervals? I think that with anything but .bed.gz, tabix will assume you have 1-based start positions.

PedroBarbosa commented 5 years ago

This 'tsv' was the original file (1-based), i changed it to bed to find a way to get it working. Nevertheless, by indexing with --zero-based, even though i know the tab file is 1-based, it doesn't produce any annotation.

PedroBarbosa commented 5 years ago

I think i found the problem. I wasn't setting the -e in tabix. Apparently i need to tabix -s1 -b2 -e2 in my tab delimited files.

brentp commented 5 years ago

glad to hear it. i was out of ideas.