Open blackball opened 7 years ago
I am thinking about CUDA. Perhaps simply replacing sgemm with cublas and convolution with cudnn alone could already give a significant boost to performance. However, I can't promise when I will come to this issue.
Are there still plans for this? Would pull requests be considered? I assume something simple to start with is add another kad_sgemm_simple that wraps cublasSgemm just like there is the option between an sgemm using blas an an sgemm implemented within kautodiff.*.
Something like:
'''
#ifdef HAVE_CUDA
#include <cuda_runtime.h>
#include "cublas_v2.h"
//other preprocess statements needed for cuda-C interaction
void kad_sgemm_simple(...)
{
//GPU memory allocation
//memcpy from host to GPU
cublasSgemm(...);
//memcpy from GPU to host
}
#endif
'''
Adding openCL support would also be useful, but I don't know of a gemm implementation that is as straight forward as cublasSgemm. I know one of the objectives on kann is to be lightweight and portable so adding this functionality might be too much? Specially when, for more complex models, the big frameworks would be a more natural choice.
@7PintsOfCherryGarcia
Training and tuning the network is very convenient using some heavy lib. But deploy models on some CPU only platform, would be very nice if we could take advantages of integrated graphic card. If the main branch is kept for simplicity and lightweight. It will still be nice to have CUDA and OpenCL features live in another fork. IMHO.
It would be interesting to hear what @attractivechaos thinks about this. I tested cublas_sgemm VS cblas_sgemm and as expected you only start getting performance gains on very large matrices. Too much overhead for data transfer between host and GPU. Of course my implementation might not be the best. By the time your models are that large, maybe going with the big frameworks is better.
Dear contributor @blackball , Simply using MKL routine if MKL were detected in the compiler enviroment, that matrix multiplication performance is beast and basically overhead-free.
Well, beside Intel MKL doens't run on ARM hardwares. Calling matrix multiplication routine from linear algebra library is in general better than naive sgemm implementation.
Thank you all. I am not familiar with CUDA. I heard with CUDA, moving data between CPU and GPU can be costly. I wonder how much speedup CUDA will deliver.
The sgemm implementation in kann optionally uses SSE and is reasonably efficient. It is several times faster than most naive implementations (see my blog post). OpenBLAS et al are about twice as fast but their multiplication function alone may have more lines of code than the entire kann. Also, kann can optionally call the BLAS sgemm API, so you can link kann to MKL:
Actually 2D CNN is probably the slowest part in kann. It would be good to call external libraries for that part. Unfortunately, kann follows the Theano's shape, which is probably not used often these days. Not sure which libraries support the Theano's shape now.
Indeed, the transfer between host and device is massive (99% of the time in my system). The actual computation of sgemm is insanely fast. I have attached a file that performs sgemm between A(5000x2000) and B(3000x2000) with 1. CUBLAS, 2a.openblas 2b. mkl, 3a. naive kad_sgemm, 3b. SSE kad_sgemm.
In my system: GNU/Linux with gcc 9.3.0, ryzen2 3700x, GTX1050Ti, cuda toolkit 11 with driver 460.39, openblas 0.3.8 assuming openblas is installed in a system directory and can be located during linking, I get: test.txt
For openblas + kad_sgemm SSE:
gcc -Wall -Wextra -DOPENBLAS -I"PATH/TO/YOUR/CUDA"/include/ -L"PATH/TO/YOUR/CUDA"/lib/ -O2 -o test test.c -lcudart -lcublas -lopenblas
./test $RANDOM
Starting routines
CUBLAS:
time for GPU allocation and transfer 222329394 ns
time cublas matmut 35090 ns
time GPU deallocation and transfer 54446963 ns
cublas total time 276838378 ns
OPENBLAS:
openblas 73790051 ns
kad_sgemm:
kad_sgemm_simple 15404507226 ns
For openblas + kad_sgemm SSE:
gcc -Wall -Wextra -DOPENBLAS -DSSE -I"PATH/TO/YOUR/CUDA"/include/ -L"PATH/TO/YOUR/CUDA"/lib/ -O2 -o test test.c -lcudart -lcublas -lopenblas
./test $RANDOM
Starting routines
CUBLAS:
time for GPU allocation and transfer 216707647 ns
time cublas matmut 33340 ns
time GPU deallocation and transfer 57922556 ns
cublas total time 274691993 ns
OPENBLAS:
openblas 76218451 ns
kad_sgemm:
with SSE
kad_sgemm_simple 5419872690 ns
For mkl sgemm + kad_sgemm SSE:
gcc -Wall -Wextra -DMKL -DMKLBLAS -m64 -I"/PATH/TO/YOUR/CUDA"/include/ -I"PATH/TO/YOUR/MKL"/include -L"PATH/TO/YOUR/CUDA"/lib/ -L"/PATH/TO/YOUR/MKL"/lib/intel64 -O2 -o test test.c -lcudart -lcublas -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core
./test $RANDOM
Starting routines
CUBLAS:
time for GPU allocation and transfer 213338323 ns
time cublas matmut 34549 ns
time GPU deallocation and transfer 42650421 ns
cublas total time 256048673 ns
MKLBLAS:
mklblas 761791581 ns
kad_sgemm:
with SSE
kad_sgemm_simple 5486498838 ns
MKL performed quite poorly, I assume it's because I am using an AMD CPU so screw Intel!!!!
Once data lives on the GPU, computation is extraordinary fast, but the overhead in data transfer renders using cuda only for matrix multiplication unfeasible. On the other hand, if all data (at least the data important for forward pass and backpropagation) resides already in GPU, significantly larger networks could be trained. This could be achieved with something like:
kad_node_t *kann_new_leaf_array(int *offset, kad_node_p *par, uint8_t flag, float x0_01, int n_d, int32_t d[KAD_MAX_DIM])
{
int i, len, off = offset && par? *offset : -1;
kad_node_t *p;
if (off >= 0 && par[off]) return par[(*offset)++];
p = (kad_node_t*)calloc(1, sizeof(kad_node_t));
p->n_d = n_d, p->flag = flag;
memcpy(p->d, d, n_d * sizeof(int32_t));
len = kad_len(p);
//******************************************
//Use CUDA runtime routines if model will be run in a GPU
#ifdef USE_CUDA
cudaError_t cudaStat;
cudaStat = cudaMalloc( (void**)&p->x, len*sizeof( (*p->x) ));
//check for errors
cudaStat = cudaMemset( (void*)p->x, 0, len )
//check for errors
//default to system
#else
p->x = (float*)calloc(len, sizeof(float));
#endif
//******************************************
if (p->n_d <= 1) {
for (i = 0; i < len; ++i)
p->x[i] = x0_01;
} else {
double sdev_inv;
sdev_inv = 1.0 / sqrt((double)len / p->d[0]);
for (i = 0; i < len; ++i)
//******************************************
//Deal with this
p->x[i] = (float)(kad_drand_normal(0) * sdev_inv);
//******************************************
}
if (off >= 0) par[off] = p, ++(*offset);
return p;
}
Of course, this means a non-trivial amount of code would need to be added defeating a core principle of KANN which is to be small, simple, lean and mean.
Wow, OpenBLAS is 71 times faster than kann's implementation. Several years ago, it was only twice as fast on my even older machine. I need to revisit matrix multiplication on more recent CPUs at some time. Anyway, users have the option to link against OpenBLAS or other BLAS implementations with
make CBLAS=/path/to/cblas/root
I have just added that to the Getting Started section of README. Thanks!
Hi,
Is there any plan to add CUDA support in the near future ? It will be very useful if we want to train some medium size network. It will also be very attractive for platforms like Tegra TK1, etc. Libraries like caffe and mxnet rely on too many libraries. Sometimes it will consume too much time to resolve these libraries conflicts during installation.