Open lexming opened 2 weeks ago
I can confirm I'm seeing this same issue:
$ module load TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1
$ python -c 'import tensorflow'
2024-11-06 13:57:10.160977: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-06 13:57:14.228266: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-06 13:57:14.228320: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-06 13:57:14.923172: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
These warning sure are a bit alarming, though probably harmless. Is it worth trying to figure out how to avoid them?
Is it worth trying to figure out how to avoid them?
I don't think so, given that it seems harmless. Since you also see those "errors", then the issue is probably caused by using CUDA 12.1 instead of 12.2. And we cannot change that at this point.
We are seeing the following errors after a simple
import tensorflow
with TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1:These are non-fatal, the import completes successfully and TF seems to work normally after these error messages. That's why sanity checks after installation with EB do pass.
Apparently it's a rather common issue (https://github.com/tensorflow/tensorflow/issues/62075) caused by some version mismatch between TF and CUDA. Upstream only tests TF v2.15 with CUDA 12.2, while we use CUDA 12.1. So that might be the reason of these errors.
Does anybody else see it in their systems? if it is caused by a version mismatch, we should see it across the board in EB. Otherwise it might be something else.