Open den-run-ai opened 5 years ago
Looks like cupy mainly found CUDA OK (note it found nvcc
), and most of those CUDA libraries in the PowerAI packaging will under lib64
in your environment (i.e. ..../powerai.1.6/lib64/
).
But the one problem library:
/data/gpfs/Users/j0541825/anaconda3/envs/powerai.1.6/bin/../lib/gcc/powerpc64le-conda_cos7-linux-gnu/7.3.0/../../../../powerpc64le-conda_cos7-linux-gnu/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
is a bit different. libcuda is kind of part of the GPU driver stack, rather than part of the CUDA Toolkit proper.
The Toolkit does include a stub copy of that library, suitable for building apps against. You should find that in $CONDA_PREFIX/lib/stubs/
. You'll probably need to inform cupy about its location (possibly via LDFLAGS
or LIBRARY_PATH
).
For running the app, you'll need the "real" libcuda.so
. That should installed in your environment in some default system search location as part of the GPU driver installation.
@hartb the issue is actually much simpler. Horovod installation instructions for powerai 1.6 from @nvcastet require a compiler toolchain installed as conda packages (see link below). This interferes with the way the libraries are setup/searched on the machine. pip install cupy
works out of the box even without powerai installation on python 3.6 powerpc anaconda environment. Note that after toolchain for Horovod is installed, the compilers are not removed as part of conda uninstall
. So the environment gets messed up. So don't mix cupy and horovod :(
https://github.com/horovod/horovod/pull/847#issuecomment-475767360
@nvcastet with powerai 1.6.1 is horovod still not provided as a conda package? How can I resolve this issue above?
@denfromufa If I can speak for @nvcastet... I'm afraid horovod is still not included as a PowerAI (now Watson Machine Learning Community Edition (WML CE)) 1.6.1 package.
But I was able to build horovod and cupy together in a PowerAI 1.6.0 container with the steps below. Maybe they'll work for you. I'm not familiar enough with horovod to exercise the components together, but at least I can get build to work:
# Install compiler as from @nvcastet's blog, but be sure to install
# gcc / g++ v7, rather than the default (v8).
conda install gxx_linux-ppc64le=7 cffi cudatoolkit-dev
# Ensure that the Anaconda compilers are visible in the path as "gcc" and "g++".
# Needed for both horovod and cupy build.
#
# nvcc tries to execute the compilers by those names specifically. It doesn't
# honor typical environment variables (e.g. CC, GCC) that would point to
# the compiler. And it invokes the compilers in a way that's not informed
# by shell aliases.
mkdir $HOME/bin
ln -s $CONDA_PREFIX/bin/*-gcc $HOME/bin/gcc
ln -s $CONDA_PREFIX/bin/*-g++ $HOME/bin/g++
export PATH="$PATH:$HOME/bin"
which gcc g++
# Build horovod as described by @nvcastet
HOROVOD_CUDA_HOME=$CONDA_PREFIX HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir
# Set up variable to help cupy build find libcuda.so
LIBCUDA_DIR=$(find /usr -name "libcuda.so" -printf "%h" -quit)
echo $LIBCUDA_DIR
# Kick off the build
LDFLAGS="$LDFLAGS -L$LIBCUDA_DIR" pip install cupy
On using OpenMPI on Power systems. I recently needed to run a model written with Horovod and mpi4py. I could not get horovodrun to work with the NCCL backend with Spectrum MPI so I eventually used openmpi. Trying to compile and run Horovod against the openmpi installed as an RPM with PowerAI's TensorFlow did not work for various reasons. What eventually worked, and worked well, was:
Ok, guys - let me test this out in the next few days :) @smatzek how did you build openmpi as a conda package, which recipe did you use?
I'm getting this error for
pip install cupy
with powerai 1.6: