Intersection of multiple vcfs

unique379r commented 5 years ago

Hi Fritz

Is there way to get intersections between multiple sample vcfs ? SURVIVOR merge can only give us common variants between tow or more vcfs comes from different tools which is really good but i also want to know what is unique in that tool.

Desired outputs: a.uniq.vcf b.uniq.vcf a.b.common.vcf

Main problem: I have obtained a merge set for each of the tools. Now i would like to have rare variants between the samples (unique variants which should NOT present in different sample). The problem, i have multiple samples ( around 60) to compare one with other. Any trick/advice would be appreciated.

fritzsedlazeck commented 5 years ago

Hi, thanks for reaching out. So just to get the full picture: you have ~60 samples and for each you have called SV with multiple caller?

What I typically do in that situation is to first merge per sample. You can require a minimum number of callers or not. Next merge across the samples based on the before generated file. You should see the individual support of the callers in the GT columns per sample.

For the question about rare vs. common variants you can use SUP= or SUP_VEC in the info field. The first gives you the number of input vcf file (e.g. sample) that carried that allele. The other (SUP_VEC) is a 0/1 vector for which samples have that variant. Note the order is the same as the input file list, which is the same order for the GT columns.

Let me know if you have further question. Thanks Fritz

unique379r commented 5 years ago

Hi fritz Thank you for the tips. Here are the full pictures of my steps.

Number of sample 58

Call SV from 3 different tools (BreakDancer, Delay and Lumpy)
Obtained a merge set using SURVIVOR merge using following commands.

sample_files:
SV from Breakdancer
SV from Delay
SV from Lumpy
./SURVIVOR merge sample_files 1000 2 1 1 0 50 sample_merged.vcf

Now I used these merge set to merge it again across the samples but using caller supports as 0.
```
samples:
MergeSample1
MergeSample2
MergeSample3
.....
.....
MergeSample58
```

./SURVIVOR merge samples 1000 0 1 1 0 30 Any_EventsAcrossSample.vcf

4. Finally I used **SUPP=1** tag to extract the calls which is only present in a sample. Simple grep "SUPP=1" did the job for me.
5. Now, I converted this final vcf as well as each sample.vcf to bedpe format using vcf2bed from SURVIVOR.

SURVIVOR vcftobed all.singleSample.sort.vcf -1 -1 all.singleSample.sort.vcf.bedpe for i in L.vcf do SURVIVOR vcftobed $i -1 -1 ${i}.bedpe done ...

5. Finally, I simply used awk command to **match the first-second-fourth-fifth columns from all.singleSample.sort.vcf.bedpe to each sample.bedpe**

for file in L.bedpe do echo -e "matching all.singleSample rare events in each sample:" $file awk '{k=$1"\t"$2"\t"$4"\t"$5} NR==FNR{a[k]; next} (k in a)' all.singleSample.sort.vcf.bedpe $file > $file.uniq wc -l $file.uniq echo "Done." done



Doubts: I am not 100% sure whether I am doing right or wrong. But it seems its work and I got the RARE variants from each strain/sample. is that correct?

Thanks for your help and advice. 
Please comment !!

Regards,
Rupesh

fritzsedlazeck commented 5 years ago

Dear Rupesh, that looks all correct to me. The only difference that I am using is that I dont require that the strands are matching. This is setting the 2nd 1 to a 0.

Also I usually operate on the vcf files since they include more information. E.g. You should also be able to see if these calls are supported by 2 or all 3 callers. Cheers Fritz

unique379r commented 5 years ago

Hi Fritz, Thanks for your reply. You mean, in step number 3 ? you are suggesting me not to use strand information ?

Note: Currently I am more focused to obtain RARE Variants. Since i already used step2 and got merge sets (calls should supported by 2 tools).

## from
./SURVIVOR merge samples 1000 0 1 1 0 30 Any_EventsAcrossSample.vcf

## to (should i have to use this ?)
./SURVIVOR merge samples 1000 0 1 0 0 30 Any_EventsAcrossSample.vcf

SURVIVOR merge -h File with VCF names and paths = samples max distance between breakpoints = 1000 Minimum number of supporting caller = 0 [any callers, in this case would be my sample] Take the type into account (1==yes, else no) = 1 Take the strands of SVs into account (1==yes, else no) = 0 [as per your advice ??] Estimate distance based on the size of SV (1==yes, else no) = 0 Minimum size of SVs to be taken into account. = 30 Output VCF filename = Any_EventsAcrossSample.vcf

Thanks Rupesh

fritzsedlazeck commented 5 years ago

It's fine if you use it. I am just saying I typically don't in either of the steps.

Cheers Fritz

unique379r commented 5 years ago

Thanks Fritz for your time and tips. I will try that too to see the differences.

Regards, Rupesh

fritzsedlazeck / SURVIVOR

Intersection of multiple vcfs #69