genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 90 forks source link

Fix multi-threading bug that causes incorrect results #106

Open jmarshall opened 5 years ago

jmarshall commented 5 years ago

We have observed that, for a fraction of the events detected in our data files, Pindel output varies with the number of threads Pindel uses. Specifically the per-sample per-strand counts are incorrect — 350+51 = 401 = 277+124, i.e., the total number of supporting reads is unchanged, but some of the reads have been assigned to the wrong strand:

T=1    D 1  …  sampleA 477 477 0 0 0 0  sampleB 443 443 277 277 124 124
T=2    D 1  …  sampleA 477 477 0 0 0 0  sampleB 443 443 277 277 124 124
T=3    D 1  …  sampleA 477 477 0 0 0 0  sampleB 443 443 277 277 124 124
T=4    D 1  …  sampleA 477 477 0 0 0 0  sampleB 443 443 277 277 124 124
…
T=16   D 1  …  sampleA 477 477 0 0 0 0  sampleB 443 443 350 350 51 51

Output was unchanged for low threading settings, but starts to differ at T=7 and by T=16 is dramatically incorrect.

This is fixed (and the output no longer varies with the number of threads used) by correcting ReadBuffer::flush() to maintain the order of m_rawreads[] entries when they are copied into m_filteredReads[] regardless of threading indeterminacy.

I suspect this patch may also fix or affect #26, which appears to be a similar problem.