ruby numo-linalg + clblast: OpenCL error: clCreateContext: -6

I'm using ruby-dnn on top of numo-linalg into which I load libclblast.so hoping it is an openblas drop in replacement.

# clinfo -l
Platform #0: NVIDIA CUDA
 +-- Device #0: NVIDIA GeForce GTX 770
 `-- Device #1: NVIDIA GeForce GT 730

sci-libs/clblast-1.5.2-r1 built with -cuda +opencl

Running a basic XOR example in ruby-dnn on my first card, which has 2G memory, of which 1G is already used.

CUDA_VISIBLE_DEVICES=0  ruby xor.rb 
nil
Epoch 1/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.2500
Epoch 2/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.0000
Epoch 3/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.0000
Epoch 4/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.0000
Epoch 5/20000
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clCreateContext: -6
Aborted

Running the same script on the second device, which is unused, all 2G mem free

CUDA_VISIBLE_DEVICES=1  ruby xor.rb 
nil
Epoch 1/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.5000
Epoch 2/20000
...
Epoch 15/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.0000
Epoch 16/20000
[========================================]  4/4 loss: -0.0000, accuracy: 0.0000
Epoch 17/20000
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clCreateContext: -6
Aborted

While doing this I observe the GPU memory going to 100% in nvtop, then with the abord the GPU memory gets freed.

As far as I know Numo::Linalg does not explicitly support clblast. https://github.com/ruby-numo/numo-linalg/blob/master/doc/select-backend.md

How I have performed these tests is I simply loaded libclblast.so instead of openblas

require "numo/linalg/linalg"
Numo::Linalg::Blas.dlopen("/usr/lib64/libclblast.so")

The data used to feed into the NN is very small, I can't yet understand what/why it eats up GBs of GPU memory.

x = SFloat[[0, 0], [1, 0], [0, 1], [1, 1]]
y = SFloat[[0], [1], [1], [0]]

Do I suspect correctly that everything works as expected until there is no more memory in the GPU? And what I only need is a freeing of GPU memory after each epoch trained? Is there any method I could do that? Maybe from outside of Numo::Linalg?

Indeed, CLBlast provides a Netlib BLAS API (see here), but it is not recommended for speed, since it will do a lot of memory copies, especially for level-1 and 2 routines it will typically give a slow down. It is also not widely used.

However, it does exist, and should in theory work. It also remains a mystery to me unfortunately why there is so much memory allocation going on, even when you run a very small example. I guess there might be a bug somewhere in the CLBlast BLAS API, but it is difficult to debug this way. I'm not familiar with numo-linalg unfortunately. One path forward could be to compile your own CLBlast from source, and give CMake -DVERBOSE=1. That will enable a lot of extra print statements, which will give us a log to see what CLBlast is doing when you call your numo-linalg code.

CNugteren / CLBlast

ruby numo-linalg + clblast: OpenCL error: clCreateContext: -6 #524