Open GoogleCodeExporter opened 9 years ago
The BiCG solver in OpenCL is introduced in r1349. It uses the clAmdBlas library
for all vector related calculations. Using the USE_CLBLAS compiler option and
the BiCG solver reduces the gap between the matrix vector multiplications to a
small fraction of before.
It was just tested on an AMD GPU yet.
Other solvers seem more complicated in their direct translation to OpenCL and
they will probably perform slower. However, to give more flexibility (in case
of poor convergence of BiCG), the translation of some more solvers seem
desirable.
r1349 - 3466ed78faa8d4b116eea025906a4e9accf742e4
Original comment by Marcus.H...@gmail.com
on 31 May 2014 at 4:49
Indeed, that is a nice proof-of-principle that can be used to estimate
potential acceleration. However, I think that a more convenient (and scalable)
approach is to leave iterative.c almost intact, but instead concentrate on
linalg.c.
So all functions in the latter should be rewritten (ifdef OCL_BLAS) through
calls to clBLAS. Actually it may be possible to use the same symbols (xvec,
pvec, etc.) and function calls in iterative.c. The only difference is that they
will be defined either as standard C vectors or as OpenCL vectors depending on
the compilation mode. The actual awareness of the type of this vectors will
only be required at the start and end of the iterative solvers (to put the
vectors in or out of the GPU).
Original comment by yurkin
on 3 Aug 2014 at 5:55
This already uses clBLAS for some time - #204
clBLAS is not developed, which have recently caused an unusual bug - #331 . This adds motivation to switch to another library. For instance, CLBlast is both actively (and better) developed and similar API interface to that of clBLAS.
Another application of clBLAS is to compute inner product inside matvec
. It is used only for a few iterative solvers (in particular, not for BiCG - the only only currently using clBLAS), but involves a large-buffer transfer from the GPU memory. The latter can become a bottleneck if other optimizations are implemented.
Another issue with clBLAS, or more generally, when the whole iteration is executed on the GPU. The only natural synchronization point is when the residual is updated (or some other scalar coefficients are computed). Therefore, timing for matrix vector product becomes completely inadequate. The only ways to fix it is either measure timing inside kernels (but I am not sure if that is possible) or add some ad hoc synchronization points. The latter may affect the performance, but not significantly (still, this can be tested). There has been similar considerations for the MPI timing, but I could not find any discussion in the issues (maybe there are some in the source code).
This actually applies to many OpenCL issues, but here is tests of current ocl
mode in ADDA (including that with OCL_BLAS
) on various GPUs (vs. seq
mode on different CPUs). This has been performed together with Michel Gross. Look inside the file comparison.pdf for details, but the conclusion so far is:
1) the main bottleneck is 3D FFT rather than moving memory to-from a GPU
2) OCL_BLAS
helps a lot for fast GPUs, because it accelerates BLAS operations (but not because it removes the memory transfers)
3) for fast GPUs, the bottleneck is related to memory bandwidth (for 3D FFT calculation) rather than pure computational power (TFLOPs). Thus, switching to single precision (#119) is not expected to provide huge gains (factors of up to 64 based on TFLOPs values for some GPUs) but rather close to two-times acceleration (based on memory bandwidth).
4) there exist other issues (#226, #248) that may cause major drop of performance for some problems. So ADDA if far from being mature in this respect.
As a side note, we have never seriously considered CUDA, not to be limited by Nvidia GPUs. However, CUDA FFT routines showed themselves to be about 1.5 times faster than clFFT (in a limited number of tests). However, I guess that systematic comparison of those two should have been performed by others.
Original issue reported on code.google.com by
Marcus.H...@gmail.com
on 31 May 2014 at 3:36