fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
546 stars 91 forks source link

snfs2vcf is different from merged vcf by bcftools #321

Open maxineliu opened 2 years ago

maxineliu commented 2 years ago

Hello, I am dealing with 12 really big aligned bam files. So I had to divided them by 11 chromosomes, unplaced scaffolds and unmapped reads. So For one sample, there are 13 .snf files (11 chromosomes+1 unplaced scaffolds + 1 unmapped reads) generated by sniffles. Then I picked 2 samples to test sniffles' population calling function, in other words, I used 26 .snf files to generate one .vcf file. At the same time, I also made a compare. First, one .vcf per sample was generated by sniffles using 13 .snf files. Second, Merging two vcf files of picked 2 samples into one vcf, utilizing bcftools. Next, I use bcftools view to check if there are diferent between them. The answer is yes! The number of SVs is 1793617 vs 2689114. The high number is generated by bcftools. Why is there such a big difference?

Many thanks, Maxine

fritzsedlazeck commented 2 years ago

I am honestly not sure I follow all the split bam logic. One reason is for sure that bcftools doesn't allow for differences in the start/ stop coordinates. Wheras this is good practice and we implemented that.. Thus you will probably see that the bcftools results are all rare variants across the samples and in the sniffles2 merge output that more SV are supported across samples. Hope that helps Fritz

maxineliu commented 2 years ago

@fritzsedlazeck Hi, Fritz. Thank you so much for the response! Can I absorb your explanation in the way that, the bcftools merge would give me more variants including some rare variants that only exist in one or few individuals, and on the other hand, sniffles2 merge would tend to give me common variants across samples? If so, more varients mean more information, I guess? Maxine

wdecoster commented 2 years ago

bcftools merge expects exact matches for the coordinates, while sniffles/SURVIVOR/jasmine (tools specific for SVs) allow some wobble around those breakpoints.

say that sample 1 has a 500bp deletion at chr1:23456-23956, and sample 2 has a 499bp deletion at chr1:23457-23956, then bcftools will think of that as two variants while it is highly likely that those two deletions are actually the same variant.