ppc64le library is broken

njzjz commented 3 years ago

[x] I read the conda-forge documentation and could not find the solution for my problem there.

Issue: When I built, installed, and ran TensorFlow along with conda-forge's CUDNN 8.0.5 on a linux-ppc64le supercomputer (longhorn), I got the following error:

Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: ELF load command alignment not page-aligned

After I downloaded the same version from NVIDIA official website and override conda-forge's libcudnn.so.8.0.5, it worked. So I believe this is an issue at the conda-forge's side.

Environment (conda list):

``` $ conda list # packages in environment at /home/07349/njzjz/deepmd-kit: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_gnu conda-forge absl-py 0.13.0 py38h6ffa863_0 aiohttp 3.7.4 py38h140841e_1 astor 0.8.1 py38h6ffa863_0 astunparse 1.6.3 py_0 async-timeout 3.0.1 py38h6ffa863_0 attrs 21.2.0 pyhd3eb1b0_0 blas 1.0 openblas blinker 1.4 py38h6ffa863_0 brotlipy 0.7.0 py38h140841e_1003 c-ares 1.17.1 h140841e_0 ca-certificates 2021.7.5 h6ffa863_1 cached-property 1.5.2 py_0 cachetools 4.2.2 pyhd3eb1b0_0 certifi 2021.5.30 py38h6ffa863_0 cffi 1.14.6 py38hf9d8e4b_0 chardet 3.0.4 py38h6ffa863_1003 click 8.0.1 pyhd3eb1b0_0 conda 4.10.3 py38h6ffa863_0 conda-package-handling 1.7.3 py38h140841e_1 coverage 5.5 py38h140841e_2 cryptography 3.4.7 py38h7ed74fa_0 cudatoolkit 10.2.89 hfd86e86_1 cudnn 8.0.5.39 h69e801d_2 conda-forge cython 0.29.24 py38h29c3540_0 dargs 0.2.6 pyh7a5ff2f_0 deepmodeling deepmd-kit 2.0.0.b4 py38_1_cuda10.2_gpu deepmodeling/label/dev fftw 3.3.9 h140841e_1 gast 0.4.0 py_0 google-auth 1.33.0 pyhd3eb1b0_0 google-auth-oauthlib 0.4.1 py_2 google-pasta 0.2.0 py_0 grpcio 1.36.1 py38hedb86c2_1 gsl 2.7 h68b80c3_0 conda-forge h5py 3.2.1 py38hf727584_0 hdf5 1.10.6 hb1b8bf9_0 idna 2.10 pyhd3eb1b0_0 importlib-metadata 3.10.0 py38h6ffa863_0 jpeg 9b hcb7ba68_2 keras-preprocessing 1.1.2 pyhd3eb1b0_0 lammps-dp 2.0.0.b4 h09db280_2 deepmodeling/label/dev ld_impl_linux-ppc64le 2.33.1 h0f24833_7 libblas 3.8.0 17_openblas conda-forge libcblas 3.8.0 17_openblas conda-forge libdeepmd 2.0.0.b4 1_cuda10.2_gpu deepmodeling/label/dev libffi 3.3 he6710b0_2 libgcc-ng 11.1.0 h16e2c27_8 conda-forge libgfortran-ng 7.3.0 h822a55f_1 libgomp 11.1.0 h16e2c27_8 conda-forge libopenblas 0.3.10 h5a2b251_0 libpng 1.6.37 hbc83047_0 libprotobuf 3.17.2 h727906f_1 libstdcxx-ng 11.1.0 h8186cfa_8 conda-forge libtensorflow_cc 2.5.0 gpu_cuda10.2_0 https://deepmodeling.njzjz.win markdown 3.3.4 py38h6ffa863_0 mpi 1.0 mpich mpich 3.3.2 hc856adb_0 multidict 5.1.0 py38h140841e_2 ncurses 6.2 he6710b0_1 numpy 1.20.3 py38hc544b32_0 numpy-base 1.20.3 py38h86143a2_0 oauthlib 3.1.1 pyhd3eb1b0_0 openssl 1.1.1k h140841e_0 opt_einsum 3.3.0 pyhd3eb1b0_1 pip 21.2.2 py38h6ffa863_0 plumed 2.6.2 hc9f090f_0 deepmodeling protobuf 3.17.2 py38h29c3540_0 pyasn1 0.4.8 py_0 pyasn1-modules 0.2.8 py_0 pycosat 0.6.3 py38h7b6447c_1 pycparser 2.20 py_2 pyjwt 2.1.0 py38h6ffa863_0 pyopenssl 20.0.1 pyhd3eb1b0_1 pysocks 1.7.1 py38h6ffa863_0 python 3.8.11 h836d2c2_0_cpython python-flatbuffers 1.12 pyhd3eb1b0_0 python-hostlist 1.21 pyh9f0ad1d_0 deepmodeling pyyaml 5.4.1 py38h140841e_1 readline 8.1 h140841e_0 requests 2.25.1 pyhd3eb1b0_0 requests-oauthlib 1.3.0 py_0 rsa 4.7.2 pyhd3eb1b0_1 ruamel_yaml 0.15.100 py38h140841e_0 scipy 1.6.2 py38h73102cc_1 setuptools 52.0.0 py38h6ffa863_0 six 1.16.0 pyhd3eb1b0_0 sqlite 3.36.0 hd7247d8_0 tensorboard 2.5.0 py_0 tensorboard-plugin-wit 1.6.0 py_0 tensorflow-base 2.5.0 gpu_py38hc66c5d3_0 file:///scratch/07349/njzjz/codes/tensorflow-base-feedstock/output tensorflow-estimator 2.5.0 pyh7b7c402_0 termcolor 1.1.0 py38h6ffa863_1 tk 8.6.10 hbc83047_0 tqdm 4.62.0 pyhd3eb1b0_1 typing-extensions 3.10.0.0 hd3eb1b0_0 typing_extensions 3.10.0.0 pyh06a4308_0 urllib3 1.26.6 pyhd3eb1b0_1 werkzeug 1.0.1 pyhd3eb1b0_0 wheel 0.35.1 pyhd3eb1b0_0 wrapt 1.12.1 py38h7b6447c_1 xz 5.2.5 h7b6447c_0 yaml 0.2.5 h7b6447c_0 yarl 1.6.3 py38h140841e_0 zipp 3.5.0 pyhd3eb1b0_0 zlib 1.2.11 h7b6447c_3 ```

Details about conda and system ( conda info ):

``` $ conda info active environment : base active env location : /home/07349/njzjz/anaconda3 shell level : 1 user config file : /home/07349/njzjz/.condarc populated config files : /home/07349/njzjz/.condarc conda version : 4.10.3 conda-build version : 3.21.4 python version : 3.8.8.final.0 virtual packages : __cuda=10.2=0 __linux=4.14.0=0 __glibc=2.17=0 __unix=0=0 __archspec=1=ppc64le base environment : /home/07349/njzjz/anaconda3 (writable) conda av data dir : /home/07349/njzjz/anaconda3/etc/conda conda av metadata url : None channel URLs : https://repo.anaconda.com/pkgs/main/linux-ppc64le https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/r/linux-ppc64le https://repo.anaconda.com/pkgs/r/noarch package cache : /scratch/07349/njzjz/conda-pkgs envs directories : /home/07349/njzjz/anaconda3/envs /home/07349/njzjz/.conda/envs platform : linux-ppc64le user-agent : conda/4.10.3 requests/2.25.1 CPython/3.8.8 Linux/4.14.0-115.10.1.el7a.ppc64le rhel/7.6 glibc/2.17 UID:GID : 866484:822414 netrc file : None offline mode : False ```

IvanYashchuk commented 3 years ago

Hey @njzjz, thank you for reporting the problem! This is quite strange because what conda-forge does with that package is unpack the cuDNN archive from NVIDIA and move the .so file to $PREFIX/lib/.

I don't have an access to ppc64le machine, could you please verify whether the problem is present if you use the same archive as conda-forge uses:

https://developer.download.nvidia.com/compute/redist/cudnn/v8.0.5/cudnn-10.2-linux-ppc64le-v8.0.5.39.tgz

Note that by downloading from above link and using the cuDNN, you accept the terms and conditions of the NVIDIA cuDNN EULA.

njzjz commented 3 years ago

verify whether the problem is present if you use the same archive as conda-forge uses:

There's no problem using this archive. Also, using readelf, I found they are different. I think the library has been modified during conda-build.

Library from NVIDIA

readelf -l libcudnn.so.8.0.5

Elf file type is DYN (Shared object file)
Entry point 0x0
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000002e6d3 0x000000000002e6d3  R E    10000
  LOAD           0x000000000002e6d8 0x000000000003e6d8 0x000000000003e6d8
                 0x00000000000002f0 0x0000000000001638  RW     10000
  DYNAMIC        0x000000000002e700 0x000000000003e700 0x000000000003e700
                 0x0000000000000250 0x0000000000000250  RW     8
  GNU_EH_FRAME   0x00000000000272d0 0x00000000000272d0 0x00000000000272d0
                 0x0000000000000874 0x0000000000000874  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     10

 Section to Segment mapping:
  Segment Sections...
   00     .hash .dynsym .dynstr .gnu.version .gnu.version_d .gnu.version_r .rela.dyn .rela.plt .init .text .fini .rodata .eh_frame_hdr .eh_frame .gcc_except_table
   01     .ctors .dtors .jcr .dynamic .data .got .plt .bss
   02     .dynamic
   03     .eh_frame_hdr
   04

Library from conda-forge

readelf -l libcudnn.so.8.0.5

Elf file type is DYN (Shared object file)
Entry point 0x0
There are 6 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000002e6d3 0x000000000002e6d3  R E    10000
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     10
  GNU_EH_FRAME   0x00000000000272d0 0x00000000000272d0 0x00000000000272d0
                 0x0000000000000874 0x0000000000000874  R      4
  LOAD           0x000000000002e6d8 0x000000000003e6d8 0x000000000003e6d8
                 0x00000000000002f0 0x0000000000001638  RW     10000
  DYNAMIC        0x000000000002e700 0x000000000003e700 0x000000000003e700
                 0x0000000000000250 0x0000000000000250  RW     8
  LOAD           0x0000000000030000 0x0000000000040000 0x0000000000040000
                 0x0000000000002c40 0x0000000000002c40  RW     1000

 Section to Segment mapping:
  Segment Sections...
   00     .dynsym .gnu.version .gnu.version_d .gnu.version_r .rela.dyn .rela.plt .init .text .fini .rodata .eh_frame_hdr .eh_frame .gcc_except_table
   01
   02     .eh_frame_hdr
   03     .ctors .dtors .jcr .dynamic .data .got .plt .bss
   04     .dynamic
   05     .dynstr .hash

leofang commented 3 years ago

@njzjz How was your TensorFlow built and installed? Generally conda-build would require the entire dependency chain to be handled through Conda, so if TF was built against the NVIDIA cuDNN but is linked against CF's cuDNN at runtime (via dlopen or other means), it would likely not work. It is expected that all binary files are patched via patchelf by conda-build to make them relocatable. One workaround is to build TF in your conda env against CF's cuDNN. This should work.

njzjz commented 3 years ago

I think TensorFlow was built against the conda-forge's CUDNN, since I didn't download NVIDIA's when I built the TensorFlow...

leofang commented 3 years ago

Hi @njzjz You mentioned you built on Longhorn. Could you check module list? Maybe your TF was built against TACC-provided CUDA software stack, including cuDNN?

njzjz commented 3 years ago

I can confirm that TACC-provided CUDA doesn't have any version of cuDNN.

(base) c002-010.longhorn(1014)$ ls /usr/local/cuda-10.2/lib64
libaccinj64.so                libcuinj64.so.10.2.89       libcusparse_static.a     libnppidei.so            libnppist_static.a      libnvgraph_static.a
libaccinj64.so.10.2           libculibos.a                liblapack_static.a       libnppidei.so.10         libnppisu.so            libnvjpeg.so
libaccinj64.so.10.2.89        libcupti.so                 libmetis_static.a        libnppidei.so.10.2.1.89  libnppisu.so.10         libnvjpeg.so.10
libcudadevrt.a                libcupti.so.10.2            libnppc.so               libnppidei_static.a      libnppisu.so.10.2.1.89  libnvjpeg.so.10.3.1.89
libcudart.so                  libcupti.so.10.2.75         libnppc.so.10            libnppif.so              libnppisu_static.a      libnvjpeg_static.a
libcudart.so.10.2             libcurand.so                libnppc.so.10.2.1.89     libnppif.so.10           libnppitc.so            libnvperf_host.so
libcudart.so.10.2.89          libcurand.so.10             libnppc_static.a         libnppif.so.10.2.1.89    libnppitc.so.10         libnvperf_target.so
libcudart_static.a            libcurand.so.10.1.2.89      libnppial.so             libnppif_static.a        libnppitc.so.10.2.1.89  libnvrtc-builtins.so
libcufft.so                   libcurand_static.a          libnppial.so.10          libnppig.so              libnppitc_static.a      libnvrtc-builtins.so.10.2
libcufft.so.10                libcusolver.so              libnppial.so.10.2.1.89   libnppig.so.10           libnpps.so              libnvrtc-builtins.so.10.2.89
libcufft.so.10.1.2.89         libcusolver.so.10           libnppial_static.a       libnppig.so.10.2.1.89    libnpps.so.10           libnvrtc.so
libcufft_static.a             libcusolver.so.10.3.0.89    libnppicc.so             libnppig_static.a        libnpps.so.10.2.1.89    libnvrtc.so.10.2
libcufft_static_nocallback.a  libcusolverMg.so            libnppicc.so.10          libnppim.so              libnpps_static.a        libnvrtc.so.10.2.89
libcufftw.so                  libcusolverMg.so.10         libnppicc.so.10.2.1.89   libnppim.so.10           libnvToolsExt.so        stubs
libcufftw.so.10               libcusolverMg.so.10.3.0.89  libnppicc_static.a       libnppim.so.10.2.1.89    libnvToolsExt.so.1
libcufftw.so.10.1.2.89        libcusolver_static.a        libnppicom.so            libnppim_static.a        libnvToolsExt.so.1.0.0
libcufftw_static.a            libcusparse.so              libnppicom.so.10         libnppist.so             libnvgraph.so
libcuinj64.so                 libcusparse.so.10           libnppicom.so.10.2.1.89  libnppist.so.10          libnvgraph.so.10
libcuinj64.so.10.2            libcusparse.so.10.3.1.89    libnppicom_static.a      libnppist.so.10.2.1.89   libnvgraph.so.10.2.89

njzjz commented 3 years ago

Here are my build logs: conda-build.e85657 conda-build.o85657 build.sh conda_build_config.yaml meta.yaml

leofang commented 3 years ago

Hi @njzjz, could you inspect the output of module list? I believe TACC has a module system that loads the compiler and software for you. Looking at /usr/local/cuda-10.2/lib64 likely does not help because cuDNN comes separately from the CUDA Toolkit, so usually it is not placed under /usr/local/cuda-XXX/lib64. I am not a TACC user and I learned their software management here: https://portal.tacc.utexas.edu/software/tensorflow.

njzjz commented 3 years ago

(base) login2.longhorn(1001)$ module list

Currently Loaded Modules:
  1) xl/16.1.1             3) git/2.24.1      5) cmake/3.16.1   7) TACC
  2) spectrum_mpi/10.3.0   4) autotools/1.2   6) xalt/2.10.21

and here are all available modules:

(base) login2.longhorn(1002)$ module av

----------------- /opt/apps/xl16/spectrum_mpi10_3/modulefiles ------------------
   petsc/3.13-complex            petsc/3.13-i64debug
   petsc/3.13-complexdebug       petsc/3.13-i64
   petsc/3.13-complexi64debug    petsc/3.13-nohdf5
   petsc/3.13-complexi64         petsc/3.13-single
   petsc/3.13-cuda               petsc/3.13-singledebug
   petsc/3.13-cudadebug          petsc/3.13-uni
   petsc/3.13-debug              petsc/3.13-unidebug
   petsc/3.13-hyprefei           petsc/3.13             (D)

-------------------------- /opt/apps/xl16/modulefiles --------------------------
   hdf5/1.10.4           netcdf/4.7.4
   mvapich2-gdr/2.3.4    spectrum_mpi/10.3.0 (L)

---------------------------- /opt/apps/modulefiles -----------------------------
   TACC                  (L)      python3/powerai_1.6.1
   autotools/1.2         (L)      python3/powerai_1.6.2
   cmake/3.16.1          (L)      python3/powerai_1.7.0  (D)
   conda/4.8.3                    pytorch-py2/1.0.1
   cuda/10.0             (g)      pytorch-py2/1.1.0      (D)
   cuda/10.1             (g)      pytorch-py3/1.0.1
   cuda/10.2             (g,D)    pytorch-py3/1.1.0
   gcc/4.9.3                      pytorch-py3/1.2.0
   gcc/6.3.0                      pytorch-py3/1.3.1      (D)
   gcc/7.3.0             (D)      sanitytool/1.5
   gcc/9.1.0                      settarg
   git/2.24.1            (L)      tacc-singularity/3.5.3
   idev/1.5.7                     tacc_tips/0.5
   launcher_gpu/1.1               tensorflow-py2/1.13.1
   lmod                           tensorflow-py2/1.14.0  (D)
   pgi/19.10.0                    tensorflow-py3/1.13.1
   pgi/20.7.0            (D)      tensorflow-py3/1.14.0
   pylauncher/3.1                 tensorflow-py3/1.15.2
   python2/powerai_1.6.0          tensorflow-py3/2.1.0   (D)
   python2/powerai_1.6.1 (D)      xalt/2.10.21           (L)
   python3/powerai_1.6.0          xl/16.1.1              (L)

  Where:
   D:  Default Module
   L:  Module is loaded
   g:  built for GPU

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

I don't find cudnn, though.

This is the module file:

(base) login2.longhorn(1041)$ cat 10.2.lua
local help_message = [[
The NVIDIA CUDA Toolkit provides a comprehensive development environment for C
and C++ developers building GPU-accelerated applications. The CUDA Toolkit
includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging
and optimizing the performance of your applications. You will also find
programming guides, user manuals, API reference, and other documentation to
help you get started quickly accelerating your application with GPUs.

This module defines the environmental variables TACC_CUDA_BIN,
TACC_CUDA_LIB, TACC_CUDA_INC, TACC_CUDA_DOC, and TACC_CUDA_DIR
for the location of the cuda binaries, libaries, includes,
documentation, and main root directory respectively.

The location of the:
1.) binary files is added to PATH
2.) libraries is added to LD_LIBRARY_PATH
3.) header files is added to INCLUDE
4.) man pages is added to MANPATH

Version 10.2
]]

help(help_message,"\n")

whatis("Name: cuda")
whatis("Version: 10.2")
whatis("Category: Compiler, Runtime Support")
whatis("Description: NVIDIA CUDA Toolkit for Linux")
whatis("URL: http://www.nvidia.com/cuda")

-- Export environmental variables
local cuda_dir="/usr/local/cuda-10.2"
local cuda_bin=pathJoin(cuda_dir,"bin")
local cuda_lib=pathJoin(cuda_dir,"lib64")
local cuda_inc=pathJoin(cuda_dir,"include")
local cuda_doc=pathJoin(cuda_dir,"doc")
setenv("TACC_CUDA_DIR",cuda_dir)
setenv("TACC_CUDA_BIN",cuda_bin)
setenv("TACC_CUDA_LIB",cuda_lib)
setenv("TACC_CUDA_INC",cuda_inc)
setenv("TACC_CUDA_DOC",cuda_doc)
prepend_path("PATH"           ,cuda_bin)
prepend_path("LD_LIBRARY_PATH",cuda_lib)
prepend_path("INCLUDE"        ,cuda_inc)
prepend_path("MANPATH"        ,pathJoin(cuda_doc,"man"))
-- Adding to MODULEPATH for CUDA-dependent packages
prepend_path("MODULEPATH"     ,pathJoin("/opt/apps","cuda10_2","modulefiles"))
add_property("arch","gpu")

leofang commented 3 years ago

Thank you, @njzjz. As part of the debugging effort, would you mind testing cuDNN for us in another way? On your ppc64le cluster (longhorn) install CuPy from conda-forge (EDIT: added a missing command):

conda create -n my_test_env -c conda-forge python cupy cudnn
conda activate my_test_env
python -c "import cupy; from cupy import cudnn"

and see if you encounter any error.

Also, share with us conda info and conda list for this env.

njzjz commented 3 years ago

Same error.

(/scratch/07349/njzjz/my_test_env) login2.longhorn(1005)$ python -c "import cupy; from cupy import cudnn"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cupy/cudnn.pyx", line 1, in init cupy.cudnn
ImportError: libcudnn.so.8: ELF load command alignment not page-aligned

(/scratch/07349/njzjz/my_test_env) login2.longhorn(1006)$ conda info

     active environment : /scratch/07349/njzjz/my_test_env
    active env location : /scratch/07349/njzjz/my_test_env
            shell level : 2
       user config file : /home/07349/njzjz/.condarc
 populated config files : /home/07349/njzjz/.condarc
          conda version : 4.10.3
    conda-build version : 3.21.4
         python version : 3.8.8.final.0
       virtual packages : __cuda=10.2=0
                          __linux=4.14.0=0
                          __glibc=2.17=0
                          __unix=0=0
                          __archspec=1=ppc64le
       base environment : /home/07349/njzjz/anaconda3  (writable)
      conda av data dir : /home/07349/njzjz/anaconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-ppc64le
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-ppc64le
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /scratch/07349/njzjz/conda-pkgs
       envs directories : /home/07349/njzjz/anaconda3/envs
                          /home/07349/njzjz/.conda/envs
               platform : linux-ppc64le
             user-agent : conda/4.10.3 requests/2.25.1 CPython/3.8.8 Linux/4.14.0-115.10.1.el7a.ppc64le rhel/7.6 glibc/2.17
                UID:GID : 866484:822414
             netrc file : None
           offline mode : False

(/scratch/07349/njzjz/my_test_env) login2.longhorn(1007)$ conda list
# packages in environment at /scratch/07349/njzjz/my_test_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
ca-certificates           2021.5.30            h1084571_0    conda-forge
certifi                   2021.5.30        py39hc1b9086_0    conda-forge
cudatoolkit               10.2.89              h455192d_8    conda-forge
cudnn                     8.0.5.39             h69e801d_2    conda-forge
cupy                      9.3.0            py39h194685b_0    conda-forge
fastrlock                 0.6              py39had50986_1    conda-forge
ld_impl_linux-ppc64le     2.36.1               ha35d02b_2    conda-forge
libblas                   3.9.0               10_openblas    conda-forge
libcblas                  3.9.0               10_openblas    conda-forge
libffi                    3.3                  hea85c5d_2    conda-forge
libgcc-ng                 11.1.0               h16e2c27_8    conda-forge
libgfortran-ng            11.1.0               hfdc3801_8    conda-forge
libgfortran5              11.1.0               h24cf76c_8    conda-forge
libgomp                   11.1.0               h16e2c27_8    conda-forge
liblapack                 3.9.0               10_openblas    conda-forge
libopenblas               0.3.17          pthreads_h486567c_1    conda-forge
libstdcxx-ng              11.1.0               h8186cfa_8    conda-forge
ncurses                   6.2                  hea85c5d_4    conda-forge
numpy                     1.21.1           py39he089932_0    conda-forge
openssl                   1.1.1k               h4e0d66e_0    conda-forge
pip                       21.2.3             pyhd8ed1ab_0    conda-forge
python                    3.9.6           h82ac395_1_cpython    conda-forge
python_abi                3.9                      2_cp39    conda-forge
readline                  8.1                  h5c45dff_0    conda-forge
setuptools                49.6.0           py39hc1b9086_3    conda-forge
sqlite                    3.36.0               h4e2196e_0    conda-forge
tk                        8.6.10               h38e1d09_1    conda-forge
tzdata                    2021a                he74cb21_1    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h6eb9509_1    conda-forge
zlib                      1.2.11            h6eb9509_1010    conda-forge

leofang commented 3 years ago

Thanks, @njzjz. Let's check what shared libraries are loaded:

LD_DEBUG=libs python -c "from cupy import cudnn" > debug.out 2>&1

Could you share the content of debug.out please?

njzjz commented 3 years ago

debug.out.txt

leofang commented 3 years ago

Thanks, @njzjz. I have a theory that I'd like to test. Will ping you when it's ready!

leofang commented 3 years ago

Hi @njzjz I have looked into it and my theory is patchelf was buggy. To verify this, could you kindly help me do two more tests (which I hope are enough)? The first one is to execute the below script in the conda env in which the broken cudnn is installed and show me the output:

from ctypes import cdll
import os
import sys
from subprocess import check_output

def check_binary(binary):
    cmd = [sys.executable, '-c', f'from ctypes import cdll; cdll.LoadLibrary("{binary}")']
    return check_output(cmd, env=os.environ)

print(check_binary(f"{os.environ['CONDA_PREFIX']}/lib/libcudnn.so"))

I expect this to fail. If this is confirmed I'll prepare the 2nd test, thanks!

leofang commented 3 years ago

btw @njzjz as part of the ELF debugging do you mind testing another package for us?

conda create -n my_test_env2 -c conda-forge python cupy cutensor  # or just install cutensor to the previous env you set up
conda activate my_test_env2
python -c "import cupy; from cupy import cutensor"

cuTENSOR is also available on ppc64le and I am now worried that it suffers from the same issue...Thanks!

njzjz commented 3 years ago

(/scratch/07349/njzjz/my_test_env) login2.longhorn(1008)$ python test.py
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/scratch/07349/njzjz/my_test_env/lib/python3.9/ctypes/__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
  File "/scratch/07349/njzjz/my_test_env/lib/python3.9/ctypes/__init__.py", line 382, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /scratch/07349/njzjz/my_test_env/lib/libcudnn.so: ELF load command alignment not page-aligned
Traceback (most recent call last):
  File "/scratch/07349/njzjz/test.py", line 11, in <module>
    print(check_binary(f"{os.environ['CONDA_PREFIX']}/lib/libcudnn.so"))
  File "/scratch/07349/njzjz/test.py", line 9, in check_binary
    return check_output(cmd, env=os.environ)
  File "/scratch/07349/njzjz/my_test_env/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/scratch/07349/njzjz/my_test_env/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/scratch/07349/njzjz/my_test_env/bin/python', '-c', 'from ctypes import cdll; cdll.LoadLibrary("/scratch/07349/njzjz/my_test_env/lib/libcudnn.so")']' returned non-zero exit status 1.

njzjz commented 3 years ago

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cupy/cutensor.pyx", line 1, in init cupy.cutensor
ImportError: libcutensor.so.1: ELF load command alignment not page-aligned

leofang commented 3 years ago

Thanks a lot, @njzjz! Either https://github.com/conda-forge/patchelf-feedstock/pull/20 or #32 will fix this issue, but the former is out of my hand so I can't give you an ETA (though I hope soon).

leofang commented 3 years ago

Hi @njzjz Thank you for your help and patience. The fix is finally done. In about an hour or two, both cuDNN and cuTENSOR will be available for installation. Could you kindly redo the test for me later today please?

conda create -n my_env3 -c conda-forge python cupy cutensor cudnn
conda activate my_env3
python -c "from cupy import cudnn; from cupy import cutensor;"

njzjz commented 3 years ago

@leofang

(/scratch/07349/njzjz/my_test_env3) login2.longhorn(1006)$ python -c "from cupy import cudnn; from cupy import cutensor;"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cupy/cutensor.pyx", line 1, in init cupy.cutensor
ImportError: /scratch/07349/njzjz/my_test_env3/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcutensor.so.1: undefined symbol: cudaMemsetAsync

njzjz commented 3 years ago

cuTENSOR has an undefined symbol. However, I think cuDNN has no problem now. So I'll close this issue.

leofang commented 3 years ago

Thanks @njzjz. Yes, let's move the discussion to https://github.com/conda-forge/cutensor-feedstock/issues/16.

conda-forge / cudnn-feedstock

ppc64le library is broken #31