Closed njzjz closed 3 years ago
Hey @njzjz, thank you for reporting the problem! This is quite strange because what conda-forge does with that package is unpack the cuDNN archive from NVIDIA and move the .so file to $PREFIX/lib/
.
I don't have an access to ppc64le machine, could you please verify whether the problem is present if you use the same archive as conda-forge uses:
https://developer.download.nvidia.com/compute/redist/cudnn/v8.0.5/cudnn-10.2-linux-ppc64le-v8.0.5.39.tgz
Note that by downloading from above link and using the cuDNN, you accept the terms and conditions of the NVIDIA cuDNN EULA.
verify whether the problem is present if you use the same archive as conda-forge uses:
There's no problem using this archive. Also, using readelf
, I found they are different. I think the library has been modified during conda-build
.
Library from NVIDIA
readelf -l libcudnn.so.8.0.5
Elf file type is DYN (Shared object file)
Entry point 0x0
There are 5 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002e6d3 0x000000000002e6d3 R E 10000
LOAD 0x000000000002e6d8 0x000000000003e6d8 0x000000000003e6d8
0x00000000000002f0 0x0000000000001638 RW 10000
DYNAMIC 0x000000000002e700 0x000000000003e700 0x000000000003e700
0x0000000000000250 0x0000000000000250 RW 8
GNU_EH_FRAME 0x00000000000272d0 0x00000000000272d0 0x00000000000272d0
0x0000000000000874 0x0000000000000874 R 4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 10
Section to Segment mapping:
Segment Sections...
00 .hash .dynsym .dynstr .gnu.version .gnu.version_d .gnu.version_r .rela.dyn .rela.plt .init .text .fini .rodata .eh_frame_hdr .eh_frame .gcc_except_table
01 .ctors .dtors .jcr .dynamic .data .got .plt .bss
02 .dynamic
03 .eh_frame_hdr
04
Library from conda-forge
readelf -l libcudnn.so.8.0.5
Elf file type is DYN (Shared object file)
Entry point 0x0
There are 6 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002e6d3 0x000000000002e6d3 R E 10000
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 10
GNU_EH_FRAME 0x00000000000272d0 0x00000000000272d0 0x00000000000272d0
0x0000000000000874 0x0000000000000874 R 4
LOAD 0x000000000002e6d8 0x000000000003e6d8 0x000000000003e6d8
0x00000000000002f0 0x0000000000001638 RW 10000
DYNAMIC 0x000000000002e700 0x000000000003e700 0x000000000003e700
0x0000000000000250 0x0000000000000250 RW 8
LOAD 0x0000000000030000 0x0000000000040000 0x0000000000040000
0x0000000000002c40 0x0000000000002c40 RW 1000
Section to Segment mapping:
Segment Sections...
00 .dynsym .gnu.version .gnu.version_d .gnu.version_r .rela.dyn .rela.plt .init .text .fini .rodata .eh_frame_hdr .eh_frame .gcc_except_table
01
02 .eh_frame_hdr
03 .ctors .dtors .jcr .dynamic .data .got .plt .bss
04 .dynamic
05 .dynstr .hash
@njzjz How was your TensorFlow built and installed? Generally conda-build would require the entire dependency chain to be handled through Conda, so if TF was built against the NVIDIA cuDNN but is linked against CF's cuDNN at runtime (via dlopen or other means), it would likely not work. It is expected that all binary files are patched via patchelf by conda-build to make them relocatable. One workaround is to build TF in your conda env against CF's cuDNN. This should work.
I think TensorFlow was built against the conda-forge's CUDNN, since I didn't download NVIDIA's when I built the TensorFlow...
Hi @njzjz You mentioned you built on Longhorn. Could you check module list
? Maybe your TF was built against TACC-provided CUDA software stack, including cuDNN?
I can confirm that TACC-provided CUDA doesn't have any version of cuDNN.
(base) c002-010.longhorn(1014)$ ls /usr/local/cuda-10.2/lib64
libaccinj64.so libcuinj64.so.10.2.89 libcusparse_static.a libnppidei.so libnppist_static.a libnvgraph_static.a
libaccinj64.so.10.2 libculibos.a liblapack_static.a libnppidei.so.10 libnppisu.so libnvjpeg.so
libaccinj64.so.10.2.89 libcupti.so libmetis_static.a libnppidei.so.10.2.1.89 libnppisu.so.10 libnvjpeg.so.10
libcudadevrt.a libcupti.so.10.2 libnppc.so libnppidei_static.a libnppisu.so.10.2.1.89 libnvjpeg.so.10.3.1.89
libcudart.so libcupti.so.10.2.75 libnppc.so.10 libnppif.so libnppisu_static.a libnvjpeg_static.a
libcudart.so.10.2 libcurand.so libnppc.so.10.2.1.89 libnppif.so.10 libnppitc.so libnvperf_host.so
libcudart.so.10.2.89 libcurand.so.10 libnppc_static.a libnppif.so.10.2.1.89 libnppitc.so.10 libnvperf_target.so
libcudart_static.a libcurand.so.10.1.2.89 libnppial.so libnppif_static.a libnppitc.so.10.2.1.89 libnvrtc-builtins.so
libcufft.so libcurand_static.a libnppial.so.10 libnppig.so libnppitc_static.a libnvrtc-builtins.so.10.2
libcufft.so.10 libcusolver.so libnppial.so.10.2.1.89 libnppig.so.10 libnpps.so libnvrtc-builtins.so.10.2.89
libcufft.so.10.1.2.89 libcusolver.so.10 libnppial_static.a libnppig.so.10.2.1.89 libnpps.so.10 libnvrtc.so
libcufft_static.a libcusolver.so.10.3.0.89 libnppicc.so libnppig_static.a libnpps.so.10.2.1.89 libnvrtc.so.10.2
libcufft_static_nocallback.a libcusolverMg.so libnppicc.so.10 libnppim.so libnpps_static.a libnvrtc.so.10.2.89
libcufftw.so libcusolverMg.so.10 libnppicc.so.10.2.1.89 libnppim.so.10 libnvToolsExt.so stubs
libcufftw.so.10 libcusolverMg.so.10.3.0.89 libnppicc_static.a libnppim.so.10.2.1.89 libnvToolsExt.so.1
libcufftw.so.10.1.2.89 libcusolver_static.a libnppicom.so libnppim_static.a libnvToolsExt.so.1.0.0
libcufftw_static.a libcusparse.so libnppicom.so.10 libnppist.so libnvgraph.so
libcuinj64.so libcusparse.so.10 libnppicom.so.10.2.1.89 libnppist.so.10 libnvgraph.so.10
libcuinj64.so.10.2 libcusparse.so.10.3.1.89 libnppicom_static.a libnppist.so.10.2.1.89 libnvgraph.so.10.2.89
Here are my build logs: conda-build.e85657 conda-build.o85657 build.sh conda_build_config.yaml meta.yaml
Hi @njzjz, could you inspect the output of module list
? I believe TACC has a module system that loads the compiler and software for you. Looking at /usr/local/cuda-10.2/lib64
likely does not help because cuDNN comes separately from the CUDA Toolkit, so usually it is not placed under /usr/local/cuda-XXX/lib64
. I am not a TACC user and I learned their software management here: https://portal.tacc.utexas.edu/software/tensorflow.
(base) login2.longhorn(1001)$ module list
Currently Loaded Modules:
1) xl/16.1.1 3) git/2.24.1 5) cmake/3.16.1 7) TACC
2) spectrum_mpi/10.3.0 4) autotools/1.2 6) xalt/2.10.21
and here are all available modules:
(base) login2.longhorn(1002)$ module av
----------------- /opt/apps/xl16/spectrum_mpi10_3/modulefiles ------------------
petsc/3.13-complex petsc/3.13-i64debug
petsc/3.13-complexdebug petsc/3.13-i64
petsc/3.13-complexi64debug petsc/3.13-nohdf5
petsc/3.13-complexi64 petsc/3.13-single
petsc/3.13-cuda petsc/3.13-singledebug
petsc/3.13-cudadebug petsc/3.13-uni
petsc/3.13-debug petsc/3.13-unidebug
petsc/3.13-hyprefei petsc/3.13 (D)
-------------------------- /opt/apps/xl16/modulefiles --------------------------
hdf5/1.10.4 netcdf/4.7.4
mvapich2-gdr/2.3.4 spectrum_mpi/10.3.0 (L)
---------------------------- /opt/apps/modulefiles -----------------------------
TACC (L) python3/powerai_1.6.1
autotools/1.2 (L) python3/powerai_1.6.2
cmake/3.16.1 (L) python3/powerai_1.7.0 (D)
conda/4.8.3 pytorch-py2/1.0.1
cuda/10.0 (g) pytorch-py2/1.1.0 (D)
cuda/10.1 (g) pytorch-py3/1.0.1
cuda/10.2 (g,D) pytorch-py3/1.1.0
gcc/4.9.3 pytorch-py3/1.2.0
gcc/6.3.0 pytorch-py3/1.3.1 (D)
gcc/7.3.0 (D) sanitytool/1.5
gcc/9.1.0 settarg
git/2.24.1 (L) tacc-singularity/3.5.3
idev/1.5.7 tacc_tips/0.5
launcher_gpu/1.1 tensorflow-py2/1.13.1
lmod tensorflow-py2/1.14.0 (D)
pgi/19.10.0 tensorflow-py3/1.13.1
pgi/20.7.0 (D) tensorflow-py3/1.14.0
pylauncher/3.1 tensorflow-py3/1.15.2
python2/powerai_1.6.0 tensorflow-py3/2.1.0 (D)
python2/powerai_1.6.1 (D) xalt/2.10.21 (L)
python3/powerai_1.6.0 xl/16.1.1 (L)
Where:
D: Default Module
L: Module is loaded
g: built for GPU
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".
I don't find cudnn, though.
This is the module file:
(base) login2.longhorn(1041)$ cat 10.2.lua
local help_message = [[
The NVIDIA CUDA Toolkit provides a comprehensive development environment for C
and C++ developers building GPU-accelerated applications. The CUDA Toolkit
includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging
and optimizing the performance of your applications. You will also find
programming guides, user manuals, API reference, and other documentation to
help you get started quickly accelerating your application with GPUs.
This module defines the environmental variables TACC_CUDA_BIN,
TACC_CUDA_LIB, TACC_CUDA_INC, TACC_CUDA_DOC, and TACC_CUDA_DIR
for the location of the cuda binaries, libaries, includes,
documentation, and main root directory respectively.
The location of the:
1.) binary files is added to PATH
2.) libraries is added to LD_LIBRARY_PATH
3.) header files is added to INCLUDE
4.) man pages is added to MANPATH
Version 10.2
]]
help(help_message,"\n")
whatis("Name: cuda")
whatis("Version: 10.2")
whatis("Category: Compiler, Runtime Support")
whatis("Description: NVIDIA CUDA Toolkit for Linux")
whatis("URL: http://www.nvidia.com/cuda")
-- Export environmental variables
local cuda_dir="/usr/local/cuda-10.2"
local cuda_bin=pathJoin(cuda_dir,"bin")
local cuda_lib=pathJoin(cuda_dir,"lib64")
local cuda_inc=pathJoin(cuda_dir,"include")
local cuda_doc=pathJoin(cuda_dir,"doc")
setenv("TACC_CUDA_DIR",cuda_dir)
setenv("TACC_CUDA_BIN",cuda_bin)
setenv("TACC_CUDA_LIB",cuda_lib)
setenv("TACC_CUDA_INC",cuda_inc)
setenv("TACC_CUDA_DOC",cuda_doc)
prepend_path("PATH" ,cuda_bin)
prepend_path("LD_LIBRARY_PATH",cuda_lib)
prepend_path("INCLUDE" ,cuda_inc)
prepend_path("MANPATH" ,pathJoin(cuda_doc,"man"))
-- Adding to MODULEPATH for CUDA-dependent packages
prepend_path("MODULEPATH" ,pathJoin("/opt/apps","cuda10_2","modulefiles"))
add_property("arch","gpu")
Thank you, @njzjz. As part of the debugging effort, would you mind testing cuDNN for us in another way? On your ppc64le cluster (longhorn) install CuPy from conda-forge (EDIT: added a missing command):
conda create -n my_test_env -c conda-forge python cupy cudnn
conda activate my_test_env
python -c "import cupy; from cupy import cudnn"
and see if you encounter any error.
Also, share with us conda info
and conda list
for this env.
Same error.
(/scratch/07349/njzjz/my_test_env) login2.longhorn(1005)$ python -c "import cupy; from cupy import cudnn"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "cupy/cudnn.pyx", line 1, in init cupy.cudnn
ImportError: libcudnn.so.8: ELF load command alignment not page-aligned
(/scratch/07349/njzjz/my_test_env) login2.longhorn(1006)$ conda info
active environment : /scratch/07349/njzjz/my_test_env
active env location : /scratch/07349/njzjz/my_test_env
shell level : 2
user config file : /home/07349/njzjz/.condarc
populated config files : /home/07349/njzjz/.condarc
conda version : 4.10.3
conda-build version : 3.21.4
python version : 3.8.8.final.0
virtual packages : __cuda=10.2=0
__linux=4.14.0=0
__glibc=2.17=0
__unix=0=0
__archspec=1=ppc64le
base environment : /home/07349/njzjz/anaconda3 (writable)
conda av data dir : /home/07349/njzjz/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/linux-ppc64le
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-ppc64le
https://repo.anaconda.com/pkgs/r/noarch
package cache : /scratch/07349/njzjz/conda-pkgs
envs directories : /home/07349/njzjz/anaconda3/envs
/home/07349/njzjz/.conda/envs
platform : linux-ppc64le
user-agent : conda/4.10.3 requests/2.25.1 CPython/3.8.8 Linux/4.14.0-115.10.1.el7a.ppc64le rhel/7.6 glibc/2.17
UID:GID : 866484:822414
netrc file : None
offline mode : False
(/scratch/07349/njzjz/my_test_env) login2.longhorn(1007)$ conda list
# packages in environment at /scratch/07349/njzjz/my_test_env:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
ca-certificates 2021.5.30 h1084571_0 conda-forge
certifi 2021.5.30 py39hc1b9086_0 conda-forge
cudatoolkit 10.2.89 h455192d_8 conda-forge
cudnn 8.0.5.39 h69e801d_2 conda-forge
cupy 9.3.0 py39h194685b_0 conda-forge
fastrlock 0.6 py39had50986_1 conda-forge
ld_impl_linux-ppc64le 2.36.1 ha35d02b_2 conda-forge
libblas 3.9.0 10_openblas conda-forge
libcblas 3.9.0 10_openblas conda-forge
libffi 3.3 hea85c5d_2 conda-forge
libgcc-ng 11.1.0 h16e2c27_8 conda-forge
libgfortran-ng 11.1.0 hfdc3801_8 conda-forge
libgfortran5 11.1.0 h24cf76c_8 conda-forge
libgomp 11.1.0 h16e2c27_8 conda-forge
liblapack 3.9.0 10_openblas conda-forge
libopenblas 0.3.17 pthreads_h486567c_1 conda-forge
libstdcxx-ng 11.1.0 h8186cfa_8 conda-forge
ncurses 6.2 hea85c5d_4 conda-forge
numpy 1.21.1 py39he089932_0 conda-forge
openssl 1.1.1k h4e0d66e_0 conda-forge
pip 21.2.3 pyhd8ed1ab_0 conda-forge
python 3.9.6 h82ac395_1_cpython conda-forge
python_abi 3.9 2_cp39 conda-forge
readline 8.1 h5c45dff_0 conda-forge
setuptools 49.6.0 py39hc1b9086_3 conda-forge
sqlite 3.36.0 h4e2196e_0 conda-forge
tk 8.6.10 h38e1d09_1 conda-forge
tzdata 2021a he74cb21_1 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 h6eb9509_1 conda-forge
zlib 1.2.11 h6eb9509_1010 conda-forge
Thanks, @njzjz. Let's check what shared libraries are loaded:
LD_DEBUG=libs python -c "from cupy import cudnn" > debug.out 2>&1
Could you share the content of debug.out
please?
Thanks, @njzjz. I have a theory that I'd like to test. Will ping you when it's ready!
Hi @njzjz I have looked into it and my theory is patchelf was buggy. To verify this, could you kindly help me do two more tests (which I hope are enough)? The first one is to execute the below script in the conda env in which the broken cudnn is installed and show me the output:
from ctypes import cdll
import os
import sys
from subprocess import check_output
def check_binary(binary):
cmd = [sys.executable, '-c', f'from ctypes import cdll; cdll.LoadLibrary("{binary}")']
return check_output(cmd, env=os.environ)
print(check_binary(f"{os.environ['CONDA_PREFIX']}/lib/libcudnn.so"))
I expect this to fail. If this is confirmed I'll prepare the 2nd test, thanks!
btw @njzjz as part of the ELF debugging do you mind testing another package for us?
conda create -n my_test_env2 -c conda-forge python cupy cutensor # or just install cutensor to the previous env you set up
conda activate my_test_env2
python -c "import cupy; from cupy import cutensor"
cuTENSOR is also available on ppc64le and I am now worried that it suffers from the same issue...Thanks!
(/scratch/07349/njzjz/my_test_env) login2.longhorn(1008)$ python test.py
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/scratch/07349/njzjz/my_test_env/lib/python3.9/ctypes/__init__.py", line 460, in LoadLibrary
return self._dlltype(name)
File "/scratch/07349/njzjz/my_test_env/lib/python3.9/ctypes/__init__.py", line 382, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /scratch/07349/njzjz/my_test_env/lib/libcudnn.so: ELF load command alignment not page-aligned
Traceback (most recent call last):
File "/scratch/07349/njzjz/test.py", line 11, in <module>
print(check_binary(f"{os.environ['CONDA_PREFIX']}/lib/libcudnn.so"))
File "/scratch/07349/njzjz/test.py", line 9, in check_binary
return check_output(cmd, env=os.environ)
File "/scratch/07349/njzjz/my_test_env/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/scratch/07349/njzjz/my_test_env/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/scratch/07349/njzjz/my_test_env/bin/python', '-c', 'from ctypes import cdll; cdll.LoadLibrary("/scratch/07349/njzjz/my_test_env/lib/libcudnn.so")']' returned non-zero exit status 1.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "cupy/cutensor.pyx", line 1, in init cupy.cutensor
ImportError: libcutensor.so.1: ELF load command alignment not page-aligned
Thanks a lot, @njzjz! Either https://github.com/conda-forge/patchelf-feedstock/pull/20 or #32 will fix this issue, but the former is out of my hand so I can't give you an ETA (though I hope soon).
Hi @njzjz Thank you for your help and patience. The fix is finally done. In about an hour or two, both cuDNN and cuTENSOR will be available for installation. Could you kindly redo the test for me later today please?
conda create -n my_env3 -c conda-forge python cupy cutensor cudnn
conda activate my_env3
python -c "from cupy import cudnn; from cupy import cutensor;"
@leofang
(/scratch/07349/njzjz/my_test_env3) login2.longhorn(1006)$ python -c "from cupy import cudnn; from cupy import cutensor;"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "cupy/cutensor.pyx", line 1, in init cupy.cutensor
ImportError: /scratch/07349/njzjz/my_test_env3/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcutensor.so.1: undefined symbol: cudaMemsetAsync
cuTENSOR has an undefined symbol. However, I think cuDNN has no problem now. So I'll close this issue.
Thanks @njzjz. Yes, let's move the discussion to https://github.com/conda-forge/cutensor-feedstock/issues/16.
Issue: When I built, installed, and ran TensorFlow along with conda-forge's CUDNN 8.0.5 on a linux-ppc64le supercomputer (longhorn), I got the following error:
After I downloaded the same version from NVIDIA official website and override conda-forge's
libcudnn.so.8.0.5
, it worked. So I believe this is an issue at the conda-forge's side.Environment (
conda list
):Details about
conda
and system (conda info
):