Open mxmlnkn opened 5 years ago
Ok, I got a working singularity definition file now. At least it compiles. It builds with base image 10.0-cudnn7-devel-ubuntu18.04
as well as 9.0-cudnn7-devel-ubuntu16.04
from the CUDA docker hub but the CUDA 9 build currently segfaults for some reason.
Bootstrap: docker
From: nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
%environment
if test -d /opt/Aluminum; then export ALUMINUM_DIR=/opt/Aluminum; fi
export CEREAL_DIR=/opt/cereal
export CNPY_DIR=/opt/cnpy
export HWLOC_DIR=/opt/hwloc
export HYDROGEN_DIR=/opt/Elemental
export LBANN_DIR=/opt/lbann
export OPENCV_DIR=/opt/opencv
( cd /opt/cub-* && export CUB_DIR=$( pwd ) )
export PATH=$LBANN_DIR/bin:$PATH
%post
if test "$0" = "/bin/sh"; then
echo "Elevating script to bash"
sed 's|.*\x0||' /proc/$$/cmdline | /bin/bash -ve
exit $?
fi
umask 022
fixCMakeCudaMPIPthreadBug()
{
# fixes https://gitlab.kitware.com/cmake/cmake/issues/17929#note_514823
find . -type f | xargs -I{} bash -c '
if grep -q -E "(nvcc|CUDA_FLAG).* -pthread" "$0"; then
sed -i -r "/nvcc.* -pthread/{s: -pthread( |$): :g}" "$0";
fi' {}
}
export PATH=/usr/local/cuda/bin:$PATH
export LIBRARY_PATH="$( which nvcc | sed 's|/bin/nvcc||' )/lib64/stubs"
echo "$LIBRARY_PATH" > /etc/ld.so.conf.d/cuda-10-stub.conf && ldconfig
apt-get -y update &&
apt-get -y install gcc g++ gfortran git curl python tar make zlib*-dev \
libopenblas-dev libopenmpi-dev libprotobuf-dev protobuf-compiler liblapack-dev
unzip(){ python -c "from zipfile import PyZipFile; PyZipFile( '''$1''' ).extractall()"; }
# Need:
# - CMake >3.12.2 because of https://github.com/clab/dynet/issues/1457
# - CMake >3.13.0 because of https://gitlab.kitware.com/cmake/cmake/issues/17929
# - CMake >3.??.? because of https://gitlab.kitware.com/cmake/cmake/issues/18897
cd /opt &&
curl -L https://github.com/Kitware/CMake/releases/download/v3.14.0-rc1/cmake-3.14.0-rc1-Linux-x86_64.tar.gz |
tar -xz && mv cmake-* /opt/cmake && ln -s /opt/cmake/bin/cmake /usr/bin/cmake
cd /opt &&
curl -L https://download.open-mpi.org/release/hwloc/v2.0/hwloc-2.0.3.tar.gz | tar -xz &&
cd hwloc-* && export HWLOC_DIR=/opt/hwloc &&
./configure --prefix="$HWLOC_DIR" && make -j $( nproc ) install
cd /opt && curl -L https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz | tar -xz &&
cd cub-* && export CUB_DIR=$( pwd )
cd /opt && curl -L https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz | tar -xz &&
cd cereal-* && mkdir build && cd $_ && export CEREAL_DIR=/opt/cereal &&
cmake -DCMAKE_INSTALL_PREFIX=/opt/cereal -DJUST_INSTALL_CEREAL=ON .. && make -j $( nproc ) install
# commit 4e8810b1a8637695171ed346ce68f6984e585ef4 to be exact but has no release and only 1 commit in last year
cd /opt && curl -L https://github.com/rogersce/cnpy/archive/master.zip -o master.zip && unzip $_ && rm $_ &&
cd cnpy-* && mkdir build && cd $_ && export CNPY_DIR=/opt/cnpy &&
cmake -DCMAKE_INSTALL_PREFIX="$CNPY_DIR" .. && make -j $( nproc ) install
# Don't manually install openmpi if the package manager version is recent enough.
# E.g., Ubuntu 18.04 ships with recent enough libopenmpi-dev 3.0.0 but Ubuntu 16.04 only ships openmpi 1.10
if test "$( dpkg-query --showformat='${Version}' --show libopenmpi-dev | sed 's|\..*||' )" -lt 3; then
apt-get -y purge libopenmpi-dev
cd /opt &&
curl -L https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.3.tar.bz2 |
tar -xj && cd openmpi-* && ./configure && make -j $( nproc ) install && ldconfig
fi
if test "$( dpkg-query --showformat='${Version}' --show libprotobuf-dev | sed 's|\..*||' )" -lt 3; then
apt-get -y purge libprotobuf-dev protobuf-compiler
cd /opt &&
curl -L https://github.com/protocolbuffers/protobuf/releases/download/v3.6.1/protobuf-cpp-3.6.1.tar.gz |
tar -xz && cd protobuf-* && ./configure && make -j $( nproc ) install && ldconfig
fi
# allow Aluminum build to fail (requires at least CUDA 9 because it uses CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS)
cd /opt && curl -L https://github.com/LLNL/Aluminum/archive/v0.2.tar.gz | tar -xz &&
cd Aluminum-* && mkdir build && cd $_ && export ALUMINUM_DIR=/opt/Aluminum &&
cmake -DCMAKE_INSTALL_PREFIX="$ALUMINUM_DIR" \
-DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \
-DALUMINUM_ENABLE_CUDA=ON \
-DALUMINUM_ENABLE_MPI_CUDA=ON \
-DALUMINUM_ENABLE_NCCL=ON .. &&
fixCMakeCudaMPIPthreadBug && make -j $( nproc ) install || true
# Needs at least CUDA 7.5 because it uses cuda_fp16.h even though Hydrogen_ENABLE_HALF=OFF Oo
cd /opt && curl -L https://github.com/LLNL/Elemental/archive/v1.1.0.tar.gz | tar -xz &&
cd Elemental-* && mkdir build && cd $_ && export HYDROGEN_DIR=/opt/Elemental &&
cmake -DHydrogen_USE_64BIT_INTS=ON \
-DHydrogen_ENABLE_OPENMP=ON \
-DBUILD_SHARED_LIBS=ON \
-DHydrogen_ENABLE_ALUMINUM=ON \
-DHydrogen_ENABLE_CUDA=ON \
-DCMAKE_INSTALL_PREFIX="$HYDROGEN_DIR" \
-DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \
-DHydrogen_AVOID_CUDA_AWARE_MPI=ON .. &&
fixCMakeCudaMPIPthreadBug && make -j $( nproc ) install
cd /opt && curl -L https://github.com/opencv/opencv/archive/3.4.3.tar.gz | tar -xz &&
cd opencv-* && mkdir build && cd $_ && export OPENCV_DIR=/opt/opencv &&
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX="$OPENCV_DIR" \
-DWITH_{JPEG,PNG,TIFF}=ON \
-DWITH_{CUDA,JASPER}=OFF \
-DBUILD_SHARED_LIBS=ON \
-DBUILD_JAVA=OFF \
-DBUILD_opencv_{calib3d,cuda,dnn,features2d,flann,java,{java,python}_bindings_generator,ml,python{2,3},stitching,ts,superres,video{,io,stab}}=OFF .. &&
make -j $( nproc ) install
cd /opt && curl -L https://github.com/LLNL/lbann/archive/v0.98.tar.gz | tar -xz &&
cd lbann-* && mkdir build && cd $_ &&
cmake -DCMAKE_INSTALL_PREFIX:PATH=/opt/lbann \
-DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \
-DLBANN_WITH_ALUMINUM:BOOL=True \
-DLBANN_WITH_CUDA=ON \
-DLBANN_USE_PROTOBUF_MODULE=ON .. &&
fixCMakeCudaMPIPthreadBug &&
# fix https://github.com/LLNL/lbann/issues/871
sed -i '0,/^#include /{ s|^#include |#include "lbann/utils/file_utils.hpp"\n#include | }' \
../src/proto/proto_common.cpp &&
make -j $( nproc ) install
%test
/opt/lbann/bin/lbann --version || true # don't let the whole container creation fail if built on system without CUDA and --notest argument was forgotten
I hope it helps someone else save the several days it took me. In hindsight the few build commands don't look very difficult. But there were so many problems and bifurcations and dead ends and mismatching versions and even CMake bugs up until the end ...
Actually, both CUDA 9 and CUDA 10 build but when running `lbann --version, I get for the CUDA 10 container:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called after throwing an instance of 'std::runtime_error'
what(): NVML error
Aborted
and for the CUDA 9 container:
Segmentation fault
I'm trying to build the singularity image with the latest cloned lbann repo on commit 8ec2d00f09565f2f14852b058ec572703b1cf28b.
The singularity version is:
When running the command specified here:
I get the usage information because there does not seem to be any
--writable
option anymore:When running the command:
It takes a while to set up the image but then it fails with:
So I added some debug output before the
. /spack/share/spack/setup-env.sh
line:And I got:
For some reason, the shell is not bash but dash. Therefore, the bashism keyword
function
is not known and setting up the spark environment fails ...Hacky workaround:
After applying the workaround and the arduous and hours long task of building gcc 4.9.3 from scratch with spack, the singularity build fails again deleting all the compiled work up to this point...
Trying further in a very unorthodox custom sandbox without specifying
^elemental@hydrogen-develop
fails with:So I also removed the
^cmake@3.9.0
, which then tries to install the default CMake 3.13 version, which correctly findslibuv
and builds but then the Aluminum build fails because the .def does not isntall any CUDA header: