LLNL / lbann

Livermore Big Artificial Neural Network Toolkit
http://software.llnl.gov/lbann/
Other
225 stars 79 forks source link

Singularity build error #855

Open mxmlnkn opened 5 years ago

mxmlnkn commented 5 years ago

I'm trying to build the singularity image with the latest cloned lbann repo on commit 8ec2d00f09565f2f14852b058ec572703b1cf28b.

The singularity version is:

singularity --version
    singularity version 3.0.2-119.ged79a2e1

When running the command specified here:

sudo singularity build --writable lbann.img lbann.def

I get the usage information because there does not seem to be any --writable option anymore:

Usage:
  singularity [global options...] build [local options...] <IMAGE PATH> <BUILD SPEC>

When running the command:

sudo singularity build lbann.img lbann.def

It takes a while to set up the image but then it fails with:

+ cd spack
+ cd ..
+ export SPACK_ROOT=/spack
+ . /spack/share/spack/setup-env.sh
+ function spack {
/bin/sh: 42: /spack/share/spack/setup-env.sh: function: not found
FATAL:   post proc: exit status 127
FATAL:   While performing build: while running engine: exit status 255

So I added some debug output before the . /spack/share/spack/setup-env.sh line:

dpkg -S $0

And I got:

+ dpkg -S /bin/sh
diversion by dash from: /bin/sh
diversion by dash to: /bin/sh.distrib

For some reason, the shell is not bash but dash. Therefore, the bashism keyword function is not known and setting up the spark environment fails ...

Hacky workaround:

%post
    if test "$0" = "/bin/sh"; then
        echo "Elevating script to bash"
        sed 's|.*\x0||' /proc/$$/cmdline | /bin/bash -v
        exit $?
    fi

    echo "Running post section"

After applying the workaround and the arduous and hours long task of building gcc 4.9.3 from scratch with spack, the singularity build fails again deleting all the compiled work up to this point...

    spack -k setup lbann@local %gcc@4.9.3 build_type=Release  cflags="-O3 -g -march=ivybridge -mtune=ivybridge" cxxflags="-O3 -g -march=ivybridge -mtune=ivybridge" fflags="-O3 -g -march=ivybridge -mtune=ivybridge" ^elemental@hydrogen-develop ^openmpi@2.0.2 ^cmake@3.9.0
==> Warning: You asked for --insecure. Will NOT check SSL certificates.
==> Error: lbann does not depend on elemental
FATAL:   post proc: exit status 1
FATAL:   While performing build: while running engine: exit status 255

Trying further in a very unorthodox custom sandbox without specifying ^elemental@hydrogen-develop fails with:

==> Installing cmake
==> Searching for binary cache of cmake
==> Warning: No Spack mirrors are currently configured
==> No binary for cmake found: installing from source
==> Fetching https://github.com/Kitware/CMake/releases/download/v3.9.0/cmake-3.9.0.tar.gz
##################################################################################################################################### 100.0%
==> Staging archive: /spack/var/spack/stage/cmake-3.9.0-42yae6q6747ueunhbadaobkxw5rfffr2/cmake-3.9.0.tar.gz
==> Created stage in /spack/var/spack/stage/cmake-3.9.0-42yae6q6747ueunhbadaobkxw5rfffr2
==> Applied patch /spack/var/spack/repos/builtin/packages/cmake/nag-response-files.patch
==> Building cmake [Package]
==> Executing phase: 'bootstrap'
==> Error: ProcessError: Command exited with status 11:
    './bootstrap' '--prefix=/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/cmake-3.9.0-42yae6q6747ueunhbadaobkxw5rfffr2' '--parallel=4' '--system-libs' '--no-system-jsoncpp' '--no-qt-gui' '--' '-DCMAKE_USE_OPENSSL=ON'

2 errors found in build log:
     307    -- Found LibRHash: /spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/rhash-1.3.5-7xgdtmsc2tcczlokpxkjhc7iysdmrcwn/lib/librhash.a
     308    -- Found ZLIB: /spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/zlib-1.2.11-7j5ttquyz3cjss5d5rbxkpk7rjxuzkqu/lib/libz.so (found
            version "1.2.11")
     309    -- Found CURL: /spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/curl-7.60.0-gr7vsc6hw7dczh7bnfuxkh4rcnenl6zu/lib/libcurl.so (fou
            nd version "7.60.0")
     310    -- Found EXPAT: /spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/expat-2.2.5-ouqt3ryixxk5dfg2abbuud52llw65xmx/lib/libexpat.so (f
            ound version "2.2.5")
     311    -- Found LibArchive: /spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/libarchive-3.3.2-4rholf423y7hmqxf2ylw4ignjf4wlbgn/lib/liba
            rchive.so (found suitable version "3.3.2", minimum required is "3.0.0")
     312    -- Could NOT find LibUV: Found unsuitable version "", but required is at least "1.0.0" (found /spack/opt/spack/linux-debian9-x8
            6_64/gcc-6.3.0/libuv-1.25.0-p33eoj36taxbc3ufznxwmvq53zue7iii/lib/libuv.so)
  >> 313    CMake Error at CMakeLists.txt:552 (message):
     314      CMAKE_USE_SYSTEM_LIBUV is ON but a libuv is not found!
     315    Call Stack (most recent call first):
     316      CMakeLists.txt:686 (CMAKE_BUILD_UTILITIES)
     317
     318
     319    -- Configuring incomplete, errors occurred!
     320    See also "/tmp/root/spack-stage/spack-stage-bdKwDF/cmake-3.9.0/CMakeFiles/CMakeOutput.log".
     321    See also "/tmp/root/spack-stage/spack-stage-bdKwDF/cmake-3.9.0/CMakeFiles/CMakeError.log".
     322    ---------------------------------------------
  >> 323    Error when bootstrapping CMake:
     324    Problem while running initial CMake
     325    ---------------------------------------------

See build log for details:
  /spack/var/spack/stage/cmake-3.9.0-42yae6q6747ueunhbadaobkxw5rfffr2/cmake-3.9.0/spack-build.out

So I also removed the ^cmake@3.9.0, which then tries to install the default CMake 3.13 version, which correctly finds libuv and builds but then the Aluminum build fails because the .def does not isntall any CUDA header:

==> Installing aluminum
==> Searching for binary cache of aluminum
==> Warning: No Spack mirrors are currently configured
==> No binary for aluminum found: installing from source
==> Cloning git repository: https://github.com/LLNL/Aluminum.git on branch master
==> No checksum needed when fetching with git
==> Already staged aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f in /spack/var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f
==> No patches needed for aluminum
==> Building aluminum [CMakePackage]
==> Executing phase: 'cmake'
==> Executing phase: 'build'
==> Error: ProcessError: Command exited with status 1:
    'ninja' '-j4'

1 error found in build log:
     32    [5/39] Linking CXX shared library src/libAl.so
     33    [6/39] Building CXX object benchmark/CMakeFiles/benchmark_reductions.exe.dir/benchmark_reductions.cpp.o
     34    [7/39] Linking CXX executable benchmark/benchmark_reductions.exe
     35    [8/39] Building CXX object benchmark/CMakeFiles/benchmark_events.exe.dir/benchmark_events.cpp.o
     36    FAILED: benchmark/CMakeFiles/benchmark_events.exe.dir/benchmark_events.cpp.o
     37    /spack/lib/spack/env/gcc/g++   -I/spack/var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/test -I/spack/
           var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/src -I. -isystem /spack/opt/spack/linux-debian9-x86_64
           /gcc-6.3.0/openmpi-2.0.2-h2jaiybovrbegf4d5q3nf26qqwl44x3r/include -isystem /spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/hwloc
           -1.11.11-g5vd3be7dirumozxpwtejpor7vcddaxt/include -Wall -Wextra -pedantic -g -O2 -g -DNDEBUG -fPIE   -fexceptions -pthread -fope
           nmp -std=gnu++11 -MD -MT benchmark/CMakeFiles/benchmark_events.exe.dir/benchmark_events.cpp.o -MF benchmark/CMakeFiles/benchmark
           _events.exe.dir/benchmark_events.cpp.o.d -o benchmark/CMakeFiles/benchmark_events.exe.dir/benchmark_events.cpp.o -c /spack/var/s
           pack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/benchmark/benchmark_events.cpp
  >> 38    /spack/var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/benchmark/benchmark_events.cpp:4:18: fatal erro
           r: cuda.h: No such file or directory
     39     #include <cuda.h>
     40                      ^
     41    compilation terminated.
     42    [9/39] Building CXX object benchmark/CMakeFiles/benchmark_pingpong.exe.dir/benchmark_pingpong.cpp.o
     43    /spack/var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/benchmark/benchmark_pingpong.cpp: In function '
           int main(int, char**)':
     44    /spack/var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/benchmark/benchmark_pingpong.cpp:128:14: warnin
           g: unused parameter 'argc' [-Wunused-parameter]

See build log for details:
  /spack/var/spack/stage/aluminum-master-ua5qdxsd2bv2stwbbshfzrx7ltzhlk2f/Aluminum/spack-build.out
mxmlnkn commented 5 years ago

Ok, I got a working singularity definition file now. At least it compiles. It builds with base image 10.0-cudnn7-devel-ubuntu18.04 as well as 9.0-cudnn7-devel-ubuntu16.04 from the CUDA docker hub but the CUDA 9 build currently segfaults for some reason.

Bootstrap: docker
From: nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

%environment
    if test -d /opt/Aluminum; then export ALUMINUM_DIR=/opt/Aluminum; fi
    export CEREAL_DIR=/opt/cereal
    export CNPY_DIR=/opt/cnpy
    export HWLOC_DIR=/opt/hwloc
    export HYDROGEN_DIR=/opt/Elemental
    export LBANN_DIR=/opt/lbann
    export OPENCV_DIR=/opt/opencv
    ( cd /opt/cub-* && export CUB_DIR=$( pwd ) )
    export PATH=$LBANN_DIR/bin:$PATH

%post
    if test "$0" = "/bin/sh"; then
        echo "Elevating script to bash"
        sed 's|.*\x0||' /proc/$$/cmdline | /bin/bash -ve
        exit $?
    fi

    umask 022

    fixCMakeCudaMPIPthreadBug()
    {
        # fixes https://gitlab.kitware.com/cmake/cmake/issues/17929#note_514823
        find . -type f | xargs -I{} bash -c '
            if grep -q -E "(nvcc|CUDA_FLAG).* -pthread" "$0"; then
                sed -i -r "/nvcc.* -pthread/{s: -pthread( |$): :g}" "$0";
            fi' {}
    }

    export PATH=/usr/local/cuda/bin:$PATH
    export LIBRARY_PATH="$( which nvcc | sed 's|/bin/nvcc||' )/lib64/stubs"
    echo "$LIBRARY_PATH" > /etc/ld.so.conf.d/cuda-10-stub.conf && ldconfig

    apt-get -y update &&
    apt-get -y install gcc g++ gfortran git curl python tar make zlib*-dev \
                       libopenblas-dev libopenmpi-dev libprotobuf-dev protobuf-compiler liblapack-dev

    unzip(){ python -c "from zipfile import PyZipFile; PyZipFile( '''$1''' ).extractall()"; }

    # Need:
    #  - CMake >3.12.2 because of https://github.com/clab/dynet/issues/1457
    #  - CMake >3.13.0 because of https://gitlab.kitware.com/cmake/cmake/issues/17929
    #  - CMake >3.??.? because of https://gitlab.kitware.com/cmake/cmake/issues/18897
    cd /opt &&
    curl -L https://github.com/Kitware/CMake/releases/download/v3.14.0-rc1/cmake-3.14.0-rc1-Linux-x86_64.tar.gz |
        tar -xz && mv cmake-* /opt/cmake && ln -s /opt/cmake/bin/cmake /usr/bin/cmake

    cd /opt &&
    curl -L https://download.open-mpi.org/release/hwloc/v2.0/hwloc-2.0.3.tar.gz | tar -xz &&
    cd hwloc-* && export HWLOC_DIR=/opt/hwloc &&
    ./configure --prefix="$HWLOC_DIR" && make -j $( nproc ) install

    cd /opt && curl -L https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz | tar -xz &&
    cd cub-* && export CUB_DIR=$( pwd )

    cd /opt && curl -L https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz | tar -xz &&
    cd cereal-* && mkdir build && cd $_ && export CEREAL_DIR=/opt/cereal &&
    cmake -DCMAKE_INSTALL_PREFIX=/opt/cereal -DJUST_INSTALL_CEREAL=ON .. && make -j $( nproc ) install

    # commit 4e8810b1a8637695171ed346ce68f6984e585ef4 to be exact but has no release and only 1 commit in last year
    cd /opt && curl -L https://github.com/rogersce/cnpy/archive/master.zip -o master.zip && unzip $_ && rm $_ &&
    cd cnpy-* && mkdir build && cd $_ && export CNPY_DIR=/opt/cnpy &&
    cmake -DCMAKE_INSTALL_PREFIX="$CNPY_DIR" .. && make -j $( nproc ) install

    # Don't manually install openmpi if the package manager version is recent enough.
    # E.g., Ubuntu 18.04 ships with recent enough libopenmpi-dev 3.0.0 but Ubuntu 16.04 only ships openmpi 1.10
    if test "$( dpkg-query --showformat='${Version}' --show libopenmpi-dev | sed 's|\..*||' )" -lt 3; then
        apt-get -y purge libopenmpi-dev
        cd /opt &&
        curl -L https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.3.tar.bz2 |
            tar -xj && cd openmpi-* && ./configure && make -j $( nproc ) install && ldconfig
    fi

    if test "$( dpkg-query --showformat='${Version}' --show libprotobuf-dev | sed 's|\..*||' )" -lt 3; then
        apt-get -y purge libprotobuf-dev protobuf-compiler
        cd /opt &&
        curl -L https://github.com/protocolbuffers/protobuf/releases/download/v3.6.1/protobuf-cpp-3.6.1.tar.gz |
            tar -xz && cd protobuf-* && ./configure && make -j $( nproc ) install && ldconfig
    fi

    # allow Aluminum build to fail (requires at least CUDA 9 because it uses CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS)
    cd /opt && curl -L https://github.com/LLNL/Aluminum/archive/v0.2.tar.gz | tar -xz &&
    cd Aluminum-* && mkdir build && cd $_ && export ALUMINUM_DIR=/opt/Aluminum &&
    cmake -DCMAKE_INSTALL_PREFIX="$ALUMINUM_DIR" \
          -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
          -DALUMINUM_ENABLE_CUDA=ON              \
          -DALUMINUM_ENABLE_MPI_CUDA=ON          \
          -DALUMINUM_ENABLE_NCCL=ON .. &&
    fixCMakeCudaMPIPthreadBug && make -j $( nproc ) install || true

    # Needs at least CUDA 7.5 because it uses cuda_fp16.h even though Hydrogen_ENABLE_HALF=OFF Oo
    cd /opt && curl -L https://github.com/LLNL/Elemental/archive/v1.1.0.tar.gz | tar -xz &&
    cd Elemental-* && mkdir build && cd $_ && export HYDROGEN_DIR=/opt/Elemental &&
    cmake -DHydrogen_USE_64BIT_INTS=ON           \
          -DHydrogen_ENABLE_OPENMP=ON            \
          -DBUILD_SHARED_LIBS=ON                 \
          -DHydrogen_ENABLE_ALUMINUM=ON          \
          -DHydrogen_ENABLE_CUDA=ON              \
          -DCMAKE_INSTALL_PREFIX="$HYDROGEN_DIR" \
          -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
          -DHydrogen_AVOID_CUDA_AWARE_MPI=ON .. &&
    fixCMakeCudaMPIPthreadBug && make -j $( nproc ) install

    cd /opt && curl -L https://github.com/opencv/opencv/archive/3.4.3.tar.gz | tar -xz &&
    cd opencv-* && mkdir build && cd $_ && export OPENCV_DIR=/opt/opencv &&
    cmake -DCMAKE_BUILD_TYPE=Release           \
          -DCMAKE_INSTALL_PREFIX="$OPENCV_DIR" \
          -DWITH_{JPEG,PNG,TIFF}=ON            \
          -DWITH_{CUDA,JASPER}=OFF             \
          -DBUILD_SHARED_LIBS=ON               \
          -DBUILD_JAVA=OFF                     \
          -DBUILD_opencv_{calib3d,cuda,dnn,features2d,flann,java,{java,python}_bindings_generator,ml,python{2,3},stitching,ts,superres,video{,io,stab}}=OFF .. &&
    make -j $( nproc ) install

    cd /opt && curl -L https://github.com/LLNL/lbann/archive/v0.98.tar.gz | tar -xz &&
    cd lbann-* && mkdir build && cd $_ &&
    cmake -DCMAKE_INSTALL_PREFIX:PATH=/opt/lbann \
          -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
          -DLBANN_WITH_ALUMINUM:BOOL=True        \
          -DLBANN_WITH_CUDA=ON                   \
          -DLBANN_USE_PROTOBUF_MODULE=ON  .. &&
    fixCMakeCudaMPIPthreadBug &&
    # fix https://github.com/LLNL/lbann/issues/871
    sed -i '0,/^#include /{ s|^#include |#include "lbann/utils/file_utils.hpp"\n#include | }' \
        ../src/proto/proto_common.cpp &&
    make -j $( nproc ) install

%test
    /opt/lbann/bin/lbann --version || true # don't let the whole container creation fail if built on system without CUDA and --notest argument was forgotten

I hope it helps someone else save the several days it took me. In hindsight the few build commands don't look very difficult. But there were so many problems and bifurcations and dead ends and mismatching versions and even CMake bugs up until the end ...

Actually, both CUDA 9 and CUDA 10 build but when running `lbann --version, I get for the CUDA 10 container:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called after throwing an instance of 'std::runtime_error'
  what():  NVML error

Aborted

and for the CUDA 9 container:

Segmentation fault