NVIDIA / tensorflow

An Open Source Machine Learning Framework for Everyone
https://developer.nvidia.com/deep-learning-frameworks
Apache License 2.0
962 stars 144 forks source link

Nvidia Tensorflow 1.15 does not use RTX3070 GPU due to failing to load CUDA library #55

Closed visheshmistry closed 2 years ago

visheshmistry commented 2 years ago

System information Device: Intel i7 11th Gen with RTX 3070 GPU and 32GB RAM OS: Ubuntu 20.04 CUDA: 11.2 Cudnn: 8.1.0 Nvidia Driver version: 470.103.01 Tensorflow: 1.15.5

Describe the current behavior

Hi all. I have an RTX 3070 GPU in an Ubuntu setting and I want to run a TF1.15 code. I installed TF1.15 using this article: https://www.pugetsystems.com/labs/hpc/How-To-Install-TensorFlow-1-15-for-NVIDIA-RTX30-GPUs-without-docker-or-CUDA-install-2005/. It basically describes using Nvidia's build of Tensorflow 1.15 for RTX 30xx GPUs.

However, whenever I create a session in tensorflow, one CUDA library always fails to load.

Describe the expected behavior

The tensorflow session created should use the GPU.

Code to reproduce the issue

import tensorflow as tf sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))`

Other info / logs When creating a tensorflow session, I get the following output:

2022-03-14 11:36:49.675521: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2496000000 Hz 2022-03-14 11:36:49.675927: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558665b3b9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2022-03-14 11:36:49.675940: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2022-03-14 11:36:49.676506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-03-14 11:36:49.704894: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-14 11:36:49.705222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties: name: NVIDIA GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.725 pciBusID: 0000:01:00.0 2022-03-14 11:36:49.705237: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2022-03-14 11:36:49.706237: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/vishesh/anaconda3/envs/tf1.15/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11; LD_LIBRARY_PATH: :/usr/local/cuda-11.2/lib64:/usr/local/cuda-11.2/lib64:/home/vishesh/anaconda3/envs/altered/lib/ 2022-03-14 11:36:49.721900: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-03-14 11:36:49.722054: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-03-14 11:36:49.723891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2022-03-14 11:36:49.725394: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2022-03-14 11:36:49.725463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2022-03-14 11:36:49.725471: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1689] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-03-14 11:36:49.785284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-03-14 11:36:49.785307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0 2022-03-14 11:36:49.785310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N 2022-03-14 11:36:49.786443: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-14 11:36:49.786787: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558664a3a3b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2022-03-14 11:36:49.786796: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6 Device mapping: /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device 2022-03-14 11:36:49.787259: I tensorflow/core/common_runtime/direct_session.cc:359] Device mapping: /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device

nluehr commented 2 years ago

The error indicates that libcublas.so.11 is loading from /home/vishesh/anaconda3/envs/tf1.15/lib/python3.8/site-packages/nvidia/cublas/lib/libcublas.so.11 as expected. But libcublasLt.so.11 is possibly being loaded from /usr/local/cuda-11.2/lib64 which might explain the conflict.

To check which cublasLt.so is getting picked up, you can run

ldd /home/vishesh/anaconda3/envs/tf1.15/lib/python3.8/site-packages/nvidia/cublas/lib/libcublas.so.11

Also, does the failure happen if /usr/loca/cuda-11.2 is removed from LD_LIBRARY_PATH?

visheshmistry commented 2 years ago

Yes, running the above command showed that /usr/local/cuda-11.2/lib64 was causing the conflict. Removing /usr/loca/cuda-11.2 from 'LD_LIBRARY_PATH' did solve the issue. Thank you!!

I'll change the path for LD_LIBRARY_PATH for this particular conda environment. That should make it work everytime I activate it and not affect the system CUDA.