Open dpblnt opened 9 months ago
Indeed, CLBlast provides a Netlib BLAS API (see here), but it is not recommended for speed, since it will do a lot of memory copies, especially for level-1 and 2 routines it will typically give a slow down. It is also not widely used.
However, it does exist, and should in theory work. It also remains a mystery to me unfortunately why there is so much memory allocation going on, even when you run a very small example. I guess there might be a bug somewhere in the CLBlast BLAS API, but it is difficult to debug this way. I'm not familiar with numo-linalg unfortunately. One path forward could be to compile your own CLBlast from source, and give CMake -DVERBOSE=1
. That will enable a lot of extra print statements, which will give us a log to see what CLBlast is doing when you call your numo-linalg code.
I'm using ruby-dnn on top of numo-linalg into which I load libclblast.so hoping it is an openblas drop in replacement.
sci-libs/clblast-1.5.2-r1 built with -cuda +opencl
Running a basic XOR example in ruby-dnn on my first card, which has 2G memory, of which 1G is already used.
Running the same script on the second device, which is unused, all 2G mem free
While doing this I observe the GPU memory going to 100% in nvtop, then with the abord the GPU memory gets freed.
As far as I know Numo::Linalg does not explicitly support clblast. https://github.com/ruby-numo/numo-linalg/blob/master/doc/select-backend.md
How I have performed these tests is I simply loaded libclblast.so instead of openblas
The data used to feed into the NN is very small, I can't yet understand what/why it eats up GBs of GPU memory.
Do I suspect correctly that everything works as expected until there is no more memory in the GPU? And what I only need is a freeing of GPU memory after each epoch trained? Is there any method I could do that? Maybe from outside of Numo::Linalg?