arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

annotating vcf with intersect #111

Closed dwaggott closed 8 years ago

dwaggott commented 8 years ago

I'm trying to annotate a vcf with a number of bedfiles. Not hard but it would be good to confirm the strategy with bedtools.

Objective a) The annotations are sorted but NOT collapsed/merged to avoid annotating positions with regions that don't overlap the position b) The output should have the same number of lines and order as the original vcf c) Multiple hits for a position should have distinct values collapsed d) Only chr, start,stop,annocol1,annocolx are required in the output. i.e. no score,name in -a and no chr, start, stop in -b

Challenge 1) intersect results in multiple hits/rows for a single position in the vcf when the annotations aren't merged. Seems like the easiest thing to do is do a merge on the output, right? 2) Intersect on a bed3 (just the index) and bed4 results in a ragged output (8 columns where there is no match and 7 where there is). Is there a way to avoid defaulting to a bed5 internally or should I add dummy columns to a bed3? See example. 3) merge requires enumeration of all the columns but these are dynamically changing. Any way to default to all columns if -o is set?

Strategy tmp1.bed tmp2.bed

# bcftools view a.vcf |  awk '{OFS="\t";print $1, $2-1, $2}'  > tmp1.bed
bedtools intersect -loj  -a tmp1.bed -b tmp2.bed > out1.bed
bedtools merge -i out1.bed -o distinct -d delim -c 5 > out2.bed