Usage of parallelism to improve performance

The rate limiting step is the pairwise alignment of sequences. In DADA2 this is based on the Needleman-Wunsch optimal alignment, but with a couple heuristics that help speed things up. The first is a kmer screen, i.e. a fast kmer distance is computed between sequences, and if that distance is too large, no alignment is performed. The second is banding of the alignment, i.e. only the "band" down the diagonal of the dynamic programming matrix is computed.

Both of those parts, the kmer screen and the computation of the banded DP matrix, are vectorized (a type of parallelism) and use SSE instructions to speed them up on x86 processors. That said, these were our own implementations, and there are almost certainly more efficient implementations that could be achieved. It also may be the case that the way we've parallelized the NW algorithm, vectorization along the subdiagonal of the DP matrix, is not as efficient as other approaches such as that developed by T. Rognes of vectorization across multiple comparisons to the same reference sequence (as described for SW alignment here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-221).

It isn't clear to me that the forms of parallelization we are currently using will ultimately benefit from moving to the massive parallelism provided by a GPU. Because we are already banding our alignments, the subdiagonal is only O(16) elements wide, and that sets the limit of intra-sequence parallelism. Also, the kmer screen we use is based on 5-mers, which given 4 bases yields 2^10=1024 length vectors that are being compared. This is a bit wider, but still not massively wide and the total kmer screen time is already significnatly less than the total NW alignment time in normal datasets.

Happy to answer more questions especially if that isn't clear (it is definitely a short form response). We do not have a publication on these algorithmic specifics. As we we look toward a future of longer reads though, revisiting and improving computation time on the critical alignment step is something we are considering. It can be faster.

benjjneb / dada2

Usage of parallelism to improve performance #1297