Closed pedrovalerolara closed 7 years ago
Given the results obtained comparing both approaches, one thread per systems and one CUDA Block (document attached) to solve multiple tridiagonal systems, we see appropriate to focus on the approach based on one thread per system to compute multiples morphologies in parallel on NVIDIA GPUs. 03-02-2017_1.pdf
This seems to be finished.
NEST-MC implements currently a Hines algorithm similar to Thomas algorithm to solve tridiagonal systems. Although this method is the optimal method in terms of number of operations, it is sequential. There are other methods which can solve the system in parallel, such as Parallel Cyclic Reduction (PCR), Cyclic Reduction, among others. In particular, there is a subrutine called gtsvStridedBatch() in cuSPARSE (http://docs.nvidia.com/cuda/cusparse/#axzz4PL9ndAze) which can solve multiple tridiagonal systems in parallel using GPUs.
Moreover, we have the experience of implementing some of these methods in CUDA, see: Pedro Valero-Lara, Alfredo Pinelli, Julien Favier, Manuel Prieto-Matías: Block Tridiagonal Solvers on Heterogeneous Architectures. ISPA 2012: 609-616 Pedro Valero-Lara, Alfredo Pinelli, Manuel Prieto-Matías: Fast finite difference Poisson solvers on heterogeneous architectures. Computer Physics Communications 185(4): 1265-1272 (2014)
There is a problem regarding to the specific features of NEST-MC, as each of the tridiagonal systems can have different sizes and the use of one additional vector "p". This makes difficult to use the cuSPARSE routine. So, we are going to implement a cuda kernel which is able to deal with multiples systems with different sizes in parallel.