fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
347 stars 46 forks source link

How to interpret figures and potential error in output from VennDiagram #125

Closed moldach closed 4 years ago

moldach commented 4 years ago

I'm trying to understand how to interpret the graphical outputs of SURVIVOR using three variant call sets: Delly, Pindel, and Lumpy.

Sorting the VCF files

# system specific
module load nixpkgs/16.09;
module load intel/2016.4;
module load vcftools/0.1.14

vcf-sort delly.vcf > delly_sorted.vcf
vcf-sort pindel.vcf > pindel_sorted.vcf
vcf-sort lumpy.vcf > lumpy_sorted.vcf

obtain merged call set

~/bin/SURVIVOR-master/Debug/SURVIVOR merge sample_files 1000 2 1 1 0 30 sample_merged.vcf

# plot the comparison of multiple input VCF files after merging
## obtain pairwise comparison matrix
~/bin/SURVIVOR-master/Debug/SURVIVOR genComp sample_merged.vcf 1 sample_merged.mat.txt

perl -ne 'print "$1\n" if /SUPP_VEC=([^,;]+)/' sample_merged.vcf | sed -e 's/\(.\)/\1 /g' > sample_merged_overlapp.txt

read into R and plot with heatmap and VennDiagram

module load nixpkgs/16.09
module load gcc/8.3.0
module load r/4.0.0
R --no-save --vanilla << EOF

t=read.table("sample_merged.mat.txt",header=F)
dst <- data.matrix(t(t))

pdf('rplot.pdf')
image(1:nrow(dst), 1:ncol(dst),log(dst), col=rev(heat.colors(10)),axes=F, xlab="", ylab="")
grid(col='black',nx=nrow(dst),ny=ncol(dst),lty=1)
dev.off()

library(VennDiagram)
venn.diagram(list(Delly=which(t[,1]==1), GRIDSS=which(t[,2]==1), Lumpy=which(t[,3]==1)) , fill = c("gray", "orange" ,"blue") , alpha = c(0.5, 0.5, 0.5), cex = 2, lty =2, filename = "my_sample_overlapp.t$
quit()
EOF

This outputs the heatmap:

CLICK IMAGE rplot.pdf

The order of the VCF files in the sample_files was 1) Delly, Pindel, Lumpy) so how would one interpret this for example?

Next is the VennDiagram:

my_sample_overlapp

This is the most confusing (and possibly erroneous) as there is no overlap? I counted 1242 variants in Delly, 86,393 in Pindel and 769 in Lumpy so I'm positive there is overlap. There looks to be nearly 600 consensus calls in the sample_merged.vcf so why would the Venn be only showing 1?

I would appreciate if you could help figuring this out. The sample_merged.mat.txt file looks like this (if that helps):

1.000000e+00    3.689655e-01    8.689655e-01
9.264069e-01    1.000000e+00    6.709957e-01
9.673704e-01    2.975048e-01    1.000000e+00
fritzsedlazeck commented 4 years ago

For the venn diagram you probably need the file: sample_merged_overlapp.txt. That includes the 0 or 1 in vectors that describe in which samples the SV is shared.

In contrast the sample_merged.mat.txt file shows the pairwise matrix. I use this for larger number of samples. Here I use a heatmap to visualize the overlaps between the samples.

Hope that helps. Fritz

dKlee99 commented 2 years ago

@moldach Hi, I've encountered similar problems. May I ask how you managed to solve it? #151 https://github.com/fritzsedlazeck/SURVIVOR/issues/151#issue-994481781