allegroai / clearml-fractional-gpu

ClearML Fractional GPU - Run multiple containers on the same GPU with driver level memory limitation ✨ and compute time-slicing
https://clear.ml
Other
57 stars 3 forks source link

failed to create cublas handle: the resource allocation failed #4

Open blinor opened 3 months ago

blinor commented 3 months ago

Hey there, I am trying to run a simple tensorflow training in a dockercontainer with fractional-gpu. No matter which one I use i always get: `>>> model.fit(x_train, y_train, epochs=50, batch_size=1000) Epoch 1/50 2024-06-06 10:53:20.251154: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:185] failed to create cublas handle: the resource allocation failed 2024-06-06 10:53:20.251203: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:188] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine. 2024-06-06 10:53:20.251227: W external/local_xla/xla/stream_executor/stream.cc:1020] attempting to perform BLAS operation using StreamExecutor without BLAS support Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InternalError: Graph execution error: Detected at node sequential/dense/MatMul defined at (most recent call last): File "", line 1, in

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1807, in fit

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1401, in train_function

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1384, in step_function

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1373, in run_step

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1150, in train_step

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 590, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/sequential.py", line 398, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/functional.py", line 515, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/functional.py", line 672, in _run_internal_graph

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/base_layer.py", line 1149, in call

File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/dense.py", line 241, in call

Blas xGEMV launch failed : a.shape=[1,1000,784], b.shape=[1,784,1], m=1000, n=1, k=784 [[{{node sequential/dense/MatMul}}]] [Op:__inference_train_function_932] ` with the official tensorflow/tensorflow:latest-gpu image, everything works as expected.