FloopCZ / tensorflow_cc

Build and install TensorFlow C++ API library.
MIT License
761 stars 183 forks source link

Inconsistent CUDA toolkit path: /usr vs /usr/lib #295

Open jimlloyd opened 2 years ago

jimlloyd commented 2 years ago

I believe this problem is probably the fault of the tensorflow configure scripts rather than anything specific to tensorflow_cc but I am hoping perhaps someone might have information for how to work around the problem.

The problem is that after doing cd tensorflow_cc && mkdir build && cd build && cmake .. && make the make fails with this error:

Inconsistent CUDA toolkit path: /usr vs /usr/lib
Asking for detailed CUDA configuration...

I have been trying to install onto a freshly created Ubuntu 20.04 or 22.04. I have tried various methods of installing the CUDA and CUdnn and all methods tried have resulted in this error.

By the way, the first method I tried was to use the Lambda Stack on 22.04. It would be awesome if tensorflow_cc was compatible with Lambda Stack. But when I discovered this "Inconsistent CUDA toolkit path" problem I concluded that Lambda Stack probably somehow altered the paths at which CUDA and cudnn were installed so I switched to more standard ways of installing. I have since learned that I run into the same problem when not using Lambda Stack, so I am hopeful that once I figure out how to solve the problem I will be able to use Lambda Stack.

My most recent attempt was with 20.04. I installed:

  1. NVidia drivers using the GUI "Additional Drivers" utility.
  2. CUDA 10.1 using sudo apt install nvidia-cuda-toolkit.
  3. cuDNN by downloading cudnn-10.1-linux-x64-v8.0.5.39 from NVidia's website and following the instructions to untar and then copy the components into /usr/local/cuda/...

FYI I have of course spent time searching for information about this exact problem "Inconsistent CUDA toolkit path:". I know it is an exception thrown from tensorflow/third_party/gpus/find_cuda_config.py. The problematic code is commented:

# XLA requires the toolkit path to find ptxas and libdevice.
# TODO(csigg): pass in both directories instead.

I have tried various hacks with the code, including simply commenting out the code that raises the exception, which allows the build to proceed but eventually results in a similar exception being raised, presumably when building XLA.

Does anyone know how to workaround this problem?

jimlloyd commented 2 years ago

Within a few minutes after writing this I did more searching and found this issue:

https://github.com/tensorflow/tensorflow/issues/40202

There is a comment: "A more reliable workaround is to install the cuda toolkit using Nvidia's .run file installer."

I'm going to try that.

FloopCZ commented 2 years ago

Yes, that may be the way or you could take a look at the official NVIDIA CUDA Docker image source on which we run the CI: https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/11.7.0/ubuntu2204/runtime/Dockerfile