Open JaheimLee opened 1 year ago
As the following error message:
bazel-out/k8-dbg/bin/_solib_local/_U@local_Uorg_Utorch_S_S_Clibtorch___Ulib/libtorch_cuda.so: undefined reference to `cudaGraphDebugDotPrint@libcudart.so.11.0'
bazel-out/k8-dbg/bin/_solib_local/_U@local_Uorg_Utorch_S_S_Clibtorch___Ulib/libtorch_cuda.so: undefined reference to `cudaGraphRetainUserObject@libcudart.so.11.0'
bazel-out/k8-dbg/bin/_solib_local/_U@local_Uorg_Utorch_S_S_Clibtorch___Ulib/libtorch_cuda.so: undefined reference to `cudaUserObjectCreate@libcudart.so.11.0'
bazel-out/k8-dbg/bin/_solib_local/_U@local_Uorg_Utorch_S_S_Clibtorch___Ulib/libtorch_cuda.so: undefined reference to `cudaStreamUpdateCaptureDependencies@libcudart.so.11.0'
bazel-out/k8-dbg/bin/_solib_local/_U@local_Uorg_Utorch_S_S_Clibtorch___Ulib/libtorch_cuda.so: undefined reference to `cudaStreamGetCaptureInfo_v2@libcudart.so.11.0'
bazel-out/k8-dbg/bin/_solib_local/_U@local_Uorg_Utorch_S_S_Clibtorch___Ulib/libtorch_cuda.so: undefined reference to `cudaGraphInstantiateWithFlags@libcudart.so.11.0'
It seems that you've cuda 11.0 while pytorch pre requires cu117.
Yeah. I have multiple CUDA package. And I manually set CUDA_HOME to cuda-11.7 as shown above both in my .bashrc and your build_pytorch_blade.sh. Why it still uses cuda 11.0?
Sorry, I may miss something.
Pytorch usually carries cuda libraries with it's wheel. You can ldd
your libtorch_cuda.so
and you'll see something like libcudart-e409450e.so.11.0
, this shared library usually lays alone with libtorch_cuda.so
at the same directory. And these missing symbols should be in it.
You can double check your pytorch installation directory and also bazel clean --expunge
to avoid bazel related issues.
Here is the output
(base) lijinghui@idc-op-dev-gpu-001:/data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib$ ldd libtorch_cuda.so
linux-vdso.so.1 (0x00007fff38354000)
libc10_cuda.so => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libc10_cuda.so (0x00007f0247b7e000)
libcudart-e409450e.so.11.0 => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libcudart-e409450e.so.11.0 (0x00007f020c7dc000)
libnvToolsExt-847d78f2.so.1 => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libnvToolsExt-847d78f2.so.1 (0x00007f020c5d1000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f020c3b2000)
libc10.so => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libc10.so (0x00007f0247aad000)
libtorch_cpu.so => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libtorch_cpu.so (0x00007f01f325f000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f01f2ec1000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f01f2cbd000)
libcublas.so.11 => /home/lijinghui/cuda/cuda-11.7/lib64/libcublas.so.11 (0x00007f01e9a5f000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f01e9857000)
libcudnn.so.8 => /home/lijinghui/cuda/cuda-11.7/lib64/libcudnn.so.8 (0x00007f01e9631000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f01e92a8000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f01e9090000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f01e8c9f000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0247a67000)
libgomp-a34b3233.so.1 => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libgomp-a34b3233.so.1 (0x00007f01e8a75000)
libcublasLt.so.11 => /home/lijinghui/cuda/cuda-11.7/lib64/libcublasLt.so.11 (0x00007f01d4ad4000)
So what should I do next? For example, copy the libcudart.so.11.7.99
from cuda-11.7 directory to pytorch installation directory and rename it as libcudart-e409450e.so.11.0
?
Here is the output
(base) lijinghui@idc-op-dev-gpu-001:/data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib$ ldd libtorch_cuda.so linux-vdso.so.1 (0x00007fff38354000) libc10_cuda.so => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libc10_cuda.so (0x00007f0247b7e000) libcudart-e409450e.so.11.0 => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libcudart-e409450e.so.11.0 (0x00007f020c7dc000) libnvToolsExt-847d78f2.so.1 => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libnvToolsExt-847d78f2.so.1 (0x00007f020c5d1000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f020c3b2000) libc10.so => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libc10.so (0x00007f0247aad000) libtorch_cpu.so => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libtorch_cpu.so (0x00007f01f325f000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f01f2ec1000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f01f2cbd000) libcublas.so.11 => /home/lijinghui/cuda/cuda-11.7/lib64/libcublas.so.11 (0x00007f01e9a5f000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f01e9857000) libcudnn.so.8 => /home/lijinghui/cuda/cuda-11.7/lib64/libcudnn.so.8 (0x00007f01e9631000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f01e92a8000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f01e9090000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f01e8c9f000) /lib64/ld-linux-x86-64.so.2 (0x00007f0247a67000) libgomp-a34b3233.so.1 => /data/miniconda3/envs/ljh_BladeDISC/lib/python3.10/site-packages/torch/lib/./libgomp-a34b3233.so.1 (0x00007f01e8a75000) libcublasLt.so.11 => /home/lijinghui/cuda/cuda-11.7/lib64/libcublasLt.so.11 (0x00007f01d4ad4000)
So what should I do next? For example, copy the
libcudart.so.11.7.99
from cuda-11.7 directory to pytorch installation directory and rename it aslibcudart-e409450e.so.11.0
?
It didn't work. Maybe I need build pytorch from source.
Hi! I have a problem when compile pytorch_blade using pre+cu117. Here is the log: