IBM / powerai

This repo contains ancillary information used to assist users of IBM Watson Machine Learning Community Edition. This repo will contain How To's, Readme's, Dockerfiles, etc. that can be consumed by users looking to get started.
BSD 2-Clause "Simplified" License
57 stars 54 forks source link

cupy and horovod cannot be installed into the same powerai 1.6 environment due to compiler incompatibilities #19

Open den-run-ai opened 5 years ago

den-run-ai commented 5 years ago

I'm getting this error for pip install cupy with powerai 1.6:

Collecting cupy
  Using cached https://files.pythonhosted.org/packages/cd/d6/532e5da87f3b513cd0b98bcbf9a58fb6758598039944c42cb93d13b71a5f/cupy-5.4.0.tar.gz
    Complete output from command python setup.py egg_info:
    Options: {'package_name': 'cupy', 'long_description': None, 'wheel_libs': [], 'no_rpath': False, 'profile': False, 'linetrace': False, 'annotate': False, 'no_cuda': False}

    -------- Configuring Module: cuda --------
    cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
    cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
    /data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/bin/../lib/gcc/powerpc64le-conda_cos7-linux-gnu/7.3.0/../../../../powerpc64le-conda_cos7-linux-gnu/bin/ld: cannot find -lcuda
    collect2: error: ld returned 1 exit status
    Cannot build a stub file.
    Original error: command '/data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/bin/powerpc64le-conda_cos7-linux-gnu-c++' failed with exit status 1

    ************************************************************
    * CuPy Configuration Summary                               *
    ************************************************************

    Build Environment:
      Include directories: ['/data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/include']
      Library directories: ['/data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/lib64', '/data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/lib']
      nvcc command       : ['/data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/bin/nvcc']

    Environment Variables:
      CFLAGS          : -mcpu=power8 -mtune=power8 -mpower8-fusion -mpower8-vector -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe
      LDFLAGS         : -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now
      LIBRARY_PATH    : (none)
      CUDA_PATH       : (none)
      NVTOOLSEXT_PATH : (none)
      NVCC            : (none)

    Modules:
      cuda      : No
        -> Cannot link libraries: ['cublas', 'cuda', 'cudart', 'cufft', 'curand', 'cusparse', 'nvrtc']
        -> Check your LDFLAGS environment variable.

    ERROR: CUDA could not be found on your system.
    Please refer to the Installation Guide for details:
    https://docs-cupy.chainer.org/en/stable/install.html

    ************************************************************

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-f62gddrq/cupy/setup.py", line 120, in <module>
        ext_modules = cupy_setup_build.get_ext_modules()
      File "/tmp/pip-install-f62gddrq/cupy/cupy_setup_build.py", line 588, in get_ext_modules
        extensions = make_extensions(arg_options, compiler, use_cython)
      File "/tmp/pip-install-f62gddrq/cupy/cupy_setup_build.py", line 384, in make_extensions
        raise Exception('Your CUDA environment is invalid. '
    Exception: Your CUDA environment is invalid. Please check above error log.

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-f62gddrq/cupy/
hartb commented 5 years ago

Looks like cupy mainly found CUDA OK (note it found nvcc), and most of those CUDA libraries in the PowerAI packaging will under lib64 in your environment (i.e. ..../powerai.1.6/lib64/).

But the one problem library:

    /data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/bin/../lib/gcc/powerpc64le-conda_cos7-linux-gnu/7.3.0/../../../../powerpc64le-conda_cos7-linux-gnu/bin/ld: cannot find -lcuda
    collect2: error: ld returned 1 exit status

is a bit different. libcuda is kind of part of the GPU driver stack, rather than part of the CUDA Toolkit proper.

The Toolkit does include a stub copy of that library, suitable for building apps against. You should find that in $CONDA_PREFIX/lib/stubs/. You'll probably need to inform cupy about its location (possibly via LDFLAGS or LIBRARY_PATH).

For running the app, you'll need the "real" libcuda.so. That should installed in your environment in some default system search location as part of the GPU driver installation.

den-run-ai commented 5 years ago

@hartb the issue is actually much simpler. Horovod installation instructions for powerai 1.6 from @nvcastet require a compiler toolchain installed as conda packages (see link below). This interferes with the way the libraries are setup/searched on the machine. pip install cupy works out of the box even without powerai installation on python 3.6 powerpc anaconda environment. Note that after toolchain for Horovod is installed, the compilers are not removed as part of conda uninstall. So the environment gets messed up. So don't mix cupy and horovod :(

https://github.com/horovod/horovod/pull/847#issuecomment-475767360

den-run-ai commented 5 years ago

@nvcastet with powerai 1.6.1 is horovod still not provided as a conda package? How can I resolve this issue above?

hartb commented 5 years ago

@denfromufa If I can speak for @nvcastet... I'm afraid horovod is still not included as a PowerAI (now Watson Machine Learning Community Edition (WML CE)) 1.6.1 package.

But I was able to build horovod and cupy together in a PowerAI 1.6.0 container with the steps below. Maybe they'll work for you. I'm not familiar enough with horovod to exercise the components together, but at least I can get build to work:

# Install compiler as from @nvcastet's blog, but be sure to install
# gcc / g++ v7, rather than the default (v8).

conda install gxx_linux-ppc64le=7 cffi cudatoolkit-dev

# Ensure that the Anaconda compilers are visible in the path as "gcc" and "g++".
# Needed for both horovod and cupy build.
#
# nvcc tries to execute the compilers by those names specifically. It doesn't
# honor typical environment variables (e.g. CC, GCC) that would point to
# the compiler. And it invokes the compilers in a way that's not informed
# by shell aliases.

mkdir $HOME/bin
ln -s $CONDA_PREFIX/bin/*-gcc $HOME/bin/gcc
ln -s $CONDA_PREFIX/bin/*-g++ $HOME/bin/g++
export PATH="$PATH:$HOME/bin"
which gcc g++

# Build horovod as described by @nvcastet

HOROVOD_CUDA_HOME=$CONDA_PREFIX HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir

# Set up variable to help cupy build find libcuda.so

LIBCUDA_DIR=$(find /usr -name "libcuda.so" -printf "%h" -quit)
echo $LIBCUDA_DIR

# Kick off the build

LDFLAGS="$LDFLAGS -L$LIBCUDA_DIR" pip install cupy
smatzek commented 5 years ago

On using OpenMPI on Power systems. I recently needed to run a model written with Horovod and mpi4py. I could not get horovodrun to work with the NCCL backend with Spectrum MPI so I eventually used openmpi. Trying to compile and run Horovod against the openmpi installed as an RPM with PowerAI's TensorFlow did not work for various reasons. What eventually worked, and worked well, was:

  1. build openmpi as a conda package
  2. Install openmpi
  3. Install compilers in the conda env.
  4. pip install/build Horovod
  5. pip install/build mpi4py
den-run-ai commented 5 years ago

Ok, guys - let me test this out in the next few days :) @smatzek how did you build openmpi as a conda package, which recipe did you use?