brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Annotating both ends of translocation #135

Open lconde-ucl opened 3 years ago

lconde-ucl commented 3 years ago

Hi,

I was wondering if it's possible to annotate both ends of a BND/TRA structural variant with vcfanno? From what I can see, it seems to annotate only one of the ends, but not the second.

For example, a BND variant in a VCF from Delly looks like this:

[...]
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  E30_CL  E43_B
chr22   38186567        BND00000378     T       T[chr6:35221592[        1980    PASS    PRECISE;SVTYPE=BND;SVMETHOD=EMBL.DELLYv0.8.7;END=38186568;CHR2=chr6;POS2=35221592;PE=22;MAPQ=60;CT=3to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=60;INSLEN=0;HOMLEN=1;SR=11;SRQ=1;CONSENSUS=TAGAGGGTGAGGAATGCTGGGGAGTAAACACAGGAAGGAGAGACCGGGGAGGGTCACAGTTGCCAAATCCCCAGCACCTAGCACGGTGCTTCGCATGTTGGAGGTGCTCAATAA;CE=1.9414;RDRATIO=1;SOMATIC   GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/1:-50.9548,0,-92.0494:10000:PASS:28322:44779:16457:2:40:24:33:20       0/0:0,-14.7411,-165.291:147:PASS:30566:68123:37557:2:63:0:49:0 

Using the config below (which simply annotates overlapping genes), vcfanno will correctly annotate the variant with the gene that overlaps the chr22:38186567 end, but it doesn't annotate the other breakpoint (chr6:35221592). However, for BNDs, you would want to know which genes are disrupted at both ends, not only one. Is there any option within vcfanno that allows both ends of the breakpoint to be annotated?

[[annotation]]
file="ens102_hg38_protCod_HGNC_sorted.bed.gz"
columns = [4]
names=["Gene"]
ops=["uniq"]

Please note that some SV callers like manta, output 2 entries per BND, for example:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  E20_B   E20_CL
chr11   15366906        MantaBND:321658:0:1:0:0:0:1     A       [chr20:49362013[ATCA    .       PASS    SVTYPE=BND;MATEID=MantaBND:321658:0:1:0:0:0:0;SVINSLEN=3;SVINSSEQ=ATC;SOMATIC;SOMATICSCORE=85;BND_DEPTH=87;MATE_BND_DEPTH=74    PR:SR    66,0:96,0       63,10:94,11
chr20   49362013        MantaBND:321658:0:1:0:0:0:0     G       [chr11:15366906[GATG    .       PASS    SVTYPE=BND;MATEID=MantaBND:321658:0:1:0:0:0:1;SVINSLEN=3;SVINSSEQ=GAT;SOMATIC;SOMATICSCORE=85;BND_DEPTH=74;MATE_BND_DEPTH=87    PR:SR    66,0:96,0       63,10:94,11

so in that case, this is not a big issue as both ends will be annotated, but other SV callers like Delly and SURVIVOR only output one line per BND and therefore only one of the breakpoints ends up being annotated

Thanks Lucia

brentp commented 3 years ago

Hi, unfortunately this would be too difficult to support in vcfanno as it expects the entries in sorted order and looking at the other end of a BND would be a random lookup. As you note, it will work when there is a BND entry in the VCF for each end (which I think is what the vcf spec would dictate). perhaps you could write a post-processor that splits single entries into 2 BNDs before sending to vcfanno?

lconde-ucl commented 3 years ago

Hi Brent thanks for your quick reply. Yes, I can see how the sorting issue can make it difficult to implement. Splitting the BDNs is an easy workaround, so I'll do that. Thanks, Lucia