brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
365 stars 56 forks source link

Annotate with 1000 genome data #78

Closed maggie-fu closed 7 years ago

maggie-fu commented 7 years ago

Hi @brentp,

I am having problem annotating my query with a vcf from 1000 genomes. No annotation have been added when there should be many overlapping variants. There was no error message pertain to this problem.

The command was: ./vcfanno_linux64 conf_new.toml NA12878.LUMPY_sorted_filtered.vcf.gz > annotated_test.vcf

The conf was:

[[annotation]]
file="annotation/1000Gphase3DUP_chr.vcf.gz"
fields=["AF", "AFR_AF", "AMR_AF", "EAS_AF", "EUR_AF", "SAS_AF"]
names=["1000G_AF", "1000G_AFR_AF", "1000G_AMR_AF", "1000G_EAS_AF", "1000G_EUR_AF", "1000G_SAS_AF"]
ops=["self", "self", "self", "self", "self", "self"]

The query and annotation files are attached here file.zip

The full error message was:

=============================================
vcfanno version 0.2.8 [built with go1.8]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 6 sources from 1 files
vcfanno.go:143: using 2 worker threads to decompress query file
bix.go:200: chromosome chrM not found in annotation/1000Gphase3DUP_chr.vcf.gz
bix.go:200: chromosome chrY not found in annotation/1000Gphase3DUP_chr.vcf.gz
bix.go:200: chromosome chrY not found in annotation/1000Gphase3DUP_chr.vcf.gz
bix.go:200: chromosome chrY not found in annotation/1000Gphase3DUP_chr.vcf.gz
bix.go:200: chromosome chrY not found in annotation/1000Gphase3DUP_chr.vcf.gz
bix.go:200: chromosome chrY not found in annotation/1000Gphase3DUP_chr.vcf.gz
vcfanno.go:241: annotated 6667 variants in 1.28 seconds (5219.3 / second)
brentp commented 7 years ago

your vcf file has no header.

maggie-fu commented 7 years ago

I also tried a vcf with header, but it had the same problem. ./vcfanno_linux64 conf_new.toml NA12878.LUMPY_sorted_filtered.vcf.gz > annotated_test.vcf

conf:

[[annotation]]
file="annotation/ALL.wgs.mergedSV.v8.20130502.svs.genotypes_sort.vcf.gz"
fields=["AF", "AFR_AF", "AMR_AF", "EAS_AF", "EUR_AF", "SAS_AF"]
names=["1000G_AF", "1000G_AFR_AF", "1000G_AMR_AF", "1000G_EAS_AF", "1000G_EUR_AF", "1000G_SAS_AF"]
ops=["self", "self", "self", "self", "self", "self"]

Annotation vcf: file.zip

brentp commented 7 years ago

vcfanno matches on REF and ALT. by default. your vcfs have different REF and ALT values.

if you want to match on position and not on REF and ALT exact match then you can use vcfanno -permissive-overlap

maggie-fu commented 7 years ago

@brentp, thank you for your help. I understand now. However, when I tried to annotate with permissive overlap with op = 'by_alt' and Number = A, all the annotations looks like this 1000GAF_float=.;AFR_AF_float=.;AMR_AF_float=.;EAS_AF_float=.;EUR_AF_float=.;SAS_AF_float=. I also tried to set op to self but yielded the same result. It worked when I set op to first or max. Is permissive overlap incompatible with multi-allelic annotations?

brentp commented 7 years ago

by_alt, as the name and docs should indicate matches on alt. that's the entire point. if I can make the docs clearer on this, please indicate how. permissive overlap and by_alt won't make sense when used together.

maggie-fu commented 7 years ago

What I understood from reading the docs was that permissive-overlap flag eliminate the requirement of shared REF and ALT, when the by_alt op present the annotation in either comma or pipe- delimited way depending on whether it is multiple annos or alt. I thought the prior is a method for matching variant when the latter is a method for data extraction and annotation. They did not seem contradictory to me. I have limited knowledge in coding and that is probably why I did not know that their functions are linked.

As I mentioned before, I also tried the op "self" and still nothing was annotated. Since the annotation file like this:

REF | ALT         | QUAL | FILTER | INFO
G   | <CN2>       | .    | PASS   | AC=64;AF=0.0127795;AFR_AF=0.0015;AMR_AF=0;AN=5008;CIEND=-150,150;CIPOS=-150,150;CS=DUP_delly;EAS_AF=0.0595;END=850204;EUR_AF=0.001;IMPRECISE;NS=2504;SAS_AF=0.001;SITEPOST=1;SVTYPE=DUP
C   | <CN0>,<CN2> | .    | PASS   | AC=3,206;AF=0.00059904,0.0411342;AFR_AF=0,0.0303;AMR_AF=0.0014,0.0259;AN=5008;CS=DUP_gs;EAS_AF=0.001,0.0615;END=755966;EUR_AF=0.001,0.0417;NS=2504;SAS_AF=0,0.045;SITEPOST=1;SVTYPE=CNV

For a query variant that overlap with both variants, I was hoping to get an annotation like: SVTYPE=DUP | CNV; AF=0.0127795 | 0.00059904, 0.0411342 I am trying to annotate structural variants, copy-number variants specifically in this case. REF is not important, so I guess I need to use permissive-overlap for sure. I just want to know how to annotate my query here? If vcfanno does not have this function yet, can I request this feature? a feature that allows annotations for multiple ALT and multiple annotations? I am sorry that I do not have an idea as to how though.

Additionally, I have another feature request. After running vcfanno, the report shows how many variants are annotated: vcfanno.go:241: annotated 6667 variants in 2.56 seconds (2607.5 / second) But it is not very helpful when the wrong parameters are set and nothing is annotated. Can you let the report include how many annotations are actually made?

brentp commented 7 years ago

with permissive overlap, just use ops=["concat"] or uniq.

brentp commented 7 years ago

it's not feasible given the current architecture to report how many variants were modified.

maggie-fu commented 7 years ago

Thank you very much. That worked perfectly!

Understood. Okay.