benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 142 forks source link

Usage of parallelism to improve performance #1297

Closed caio-davi closed 5 months ago

caio-davi commented 3 years ago

Is there any implementation taking advantage of GPU's parallel processing? CUDA, for example. At a first glance, the algorithm seems to be apt to parallel implementation, but I don't have a deep knowledge of the algorithm, thus I'm not sure about it. Also, could you give me some references for it (the math behind dada2)? I read your Nature paper, but it isn't focused on the algorithm or the implementation. Do you have any other publications more related to such things? Thank you.

benjjneb commented 3 years ago

The rate limiting step is the pairwise alignment of sequences. In DADA2 this is based on the Needleman-Wunsch optimal alignment, but with a couple heuristics that help speed things up. The first is a kmer screen, i.e. a fast kmer distance is computed between sequences, and if that distance is too large, no alignment is performed. The second is banding of the alignment, i.e. only the "band" down the diagonal of the dynamic programming matrix is computed.

Both of those parts, the kmer screen and the computation of the banded DP matrix, are vectorized (a type of parallelism) and use SSE instructions to speed them up on x86 processors. That said, these were our own implementations, and there are almost certainly more efficient implementations that could be achieved. It also may be the case that the way we've parallelized the NW algorithm, vectorization along the subdiagonal of the DP matrix, is not as efficient as other approaches such as that developed by T. Rognes of vectorization across multiple comparisons to the same reference sequence (as described for SW alignment here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-221).

It isn't clear to me that the forms of parallelization we are currently using will ultimately benefit from moving to the massive parallelism provided by a GPU. Because we are already banding our alignments, the subdiagonal is only O(16) elements wide, and that sets the limit of intra-sequence parallelism. Also, the kmer screen we use is based on 5-mers, which given 4 bases yields 2^10=1024 length vectors that are being compared. This is a bit wider, but still not massively wide and the total kmer screen time is already significnatly less than the total NW alignment time in normal datasets.

Happy to answer more questions especially if that isn't clear (it is definitely a short form response). We do not have a publication on these algorithmic specifics. As we we look toward a future of longer reads though, revisiting and improving computation time on the critical alignment step is something we are considering. It can be faster.