Open isuruf opened 4 years ago
I'm not sure that is what that means.
Inside the cudatoolkit
package, there are things like libcudart.so.10.2
. If a library links against that (as cupy does, sorry Azure changed UI so will need to scroll), then it will be broken.
cc @kkraus14 @mike-wendt
The relevant text from the release is:
Also in this release the soname of the libraries has been modified to not include the minor toolkit version number. For example, the cuFFT library soname has changed from libcufft.so.10.1 to libcufft.so.10. This is done to facilitate any future library updates that do not include API breaking changes without the need to relink.
My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.
My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.
This is my experience too. Please don’t do this before we can confirm NVIDIA stabilizes its versioning scheme. Think about the nuisance of 10.1 Update 0/1/2 just not long ago...
My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.
I don't understand. Can you explain?
If I install PyTorch and Tensorflow built with cudatoolkit 10.0, then remove cudatoolkit 10.0 and install 10.1 both fail to run test scripts:
# python gpu_test.py
2019-12-20 18:13:14.543411: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-20 18:13:14.571836: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-20 18:13:14.572530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.845
pciBusID: 0000:01:00.0
2019-12-20 18:13:14.572616: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572666: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572699: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572730: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572761: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572797: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.574955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-20 18:13:14.574967: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-12-20 18:13:14.575226: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-12-20 18:13:14.597185: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-12-20 18:13:14.599534: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5598694d25c0 executing computations on platform Host. Devices:
2019-12-20 18:13:14.599599: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2019-12-20 18:13:14.599790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-20 18:13:14.599836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
2019-12-20 18:13:14.662175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-20 18:13:14.662776: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559869535230 executing computations on platform CUDA. Devices:
2019-12-20 18:13:14.662791: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
Traceback (most recent call last):
File "gpu_test.py", line 5, in <module>
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
allow_broadcast=True)
File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 96, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.
# conda activate pytorch
(pytorch) [root@chi9 io]# python pytorch_test.py
Traceback (most recent call last):
File "pytorch_test.py", line 3, in <module>
import torch
File "/opt/conda/envs/pytorch/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
from torch._C import *
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory
Both of these package are trying to dlopen the major.minor libraries. Perhaps it is possible for these projects to switch to using the major only library but this is not how they are currently setup.
That doesn't work. This only works with 10.1 onwards as the link mentions. Try doing the same with 10.1 and 10.2
Unfortunately, I do not have packages nor a machine configured to test 10.1 vs 10.2 at the moment.
That doesn't work. This only works with 10.1 onwards as the link mentions. Try doing the same with 10.1 and 10.2
@isuruf, as noted above it doesn't. libcudart
includes the major and minor version in the SONAME.
Ah, then we should split cudatoolkit into 2 packages so that CUDA packages built with 10.1 will get the benefits of 10.2 where applicable
Examining the runtime docker images from docker hub it appears as if most of the libraries use a major only SONAME but three (libcudart.so, libnvrtc-builtins.so and libnvrtc.so.10.2) use major.minor.
These two groups could made into two different conda packages so that the compatible libraries can be installed into a 10.1 environment. The existing cudatoolkit
packages will likely need to have a run_constrain
added to avoid clobbering.
Ah, then we should split cudatoolkit into 2 packages so that CUDA packages built with 10.1 will get the benefits of 10.2 where applicable
That's an interesting idea. Could be reasonable. Have not personally explored this.
@kkraus14 @mike-wendt, do you have any thoughts on this idea?
I'm not opposed to the idea of turning cudatoolkit
into a metapackage and breaking it up, what would be the proposed split of packages?
IIUC it would be split along the lines of which libraries include the CUDA minor version (like .1
or .2
) in their SONAME or not. Though I suppose it could be more granular than that. Does this sound correct to you @isuruf or did you have something else in mind?
Sorry for a stupid question: If we split cudatoolkit
, what would happen when we check the runtime versions via cudaRuntimeGetVersion
and individual libraries' API? Detecting versions at runtime correctly is important at least for CuPy afaik.
I think cudaRuntimeGetVersion
comes from the CUDA Runtime API (libcudart
). So that would still be tracking the patch version.
Thanks @jakirkham. So sounds like with the splitting we could get a 10.2 runtime but, say, a 10.1 cuFFT or cuRAND coexist.
Sorry I wasn't paying attention to @jjhelmus's original comment:
it appears as if most of the libraries use a major only SONAME but three (libcudart.so, libnvrtc-builtins.so and libnvrtc.so.10.2) use major.minor.
These two groups could made into two different conda packages so that the compatible libraries can be installed into a 10.1 environment. The existing
cudatoolkit
packages will likely need to have arun_constrain
added to avoid clobbering.
So would this work for applications depending on NVRTC, built with 10.1, and running with 10.2? I don't see any guarantee of API/ABI compatibility mentioned in NVRTC's documentation, so if its SOs' names have major.minor, this is a bit worrying...
So would this work for applications depending on NVRTC, built with 10.1, and running with 10.2? I don't see any guarantee of API/ABI compatibility mentioned in NVRTC's documentation, so if its SOs' names have major.minor, this is a bit worrying
Please read @jjhelmus's comment carefully. NVRTC (and CUDART) would be in the group of packages that pins to major.minor and the others would be in the group of packages that are pinned to major.
See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-general-new-features