Open mxmlnkn opened 5 years ago
This looks like the hang described in #879. I don't understand why El::Matrix::do_get_device_()
would have a problem since it is a one-line function that just returns an enum (https://github.com/LLNL/Elemental/blob/3b784761bfb587f52ea1d6f882cb6905083bf795/include/El/core/Matrix/impl_cpu.hpp#L230). This will require some digging.
Thanks for the answer. I updated LBANN to commit 10b84da933a7b62e63120c6f9067df17cd9ba5f3 and Elemental to 6d4bc32515087ed7c8c1dd2687dd2cc771c139d3 and now it works ... sometimes. Let's say 20% of the time, LBANN hangs after an epoch ended. It looks to me like this is a race condition.
I tried the same thing with perf but as seeing that the CPU is not busy I almost knew that it wouldn't work. I only get:
sudo timeout 20s perf record -a --call-graph dwarf -p $( pgrep lbann2 | tail -1 )
Warning:
PID/TID switch overriding SYSTEM[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.750 MB perf.data ]
/usr/bin/perf report --no-children -T -i perf.data | c++filt
Error:
The perf.data file has no samples!
# To display the perf.data header info, please use --header/--header-only options.
#
Doing a backtrace from outside the container by using sudo gdb "$lbannBinPath $( pidof "$lbannBinPath" )
, and then bt
, I get:
#0 0x00007f31ed745289 in syscall () from target:/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f31eda68b48 in std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from target:/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f31f59cb799 in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test_until (__ns=..., __s=...,
__has_timeout=<optimized out>, __mo=<optimized out>, __equal=<optimized out>, __operand=<optimized out>, __assumed=<optimized out>,
this=<optimized out>) at /usr/include/c++/8/bits/atomic_base.h:542
#3 std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test (__mo=<optimized out>, __equal=<optimized out>, __operand=<optimized out>,
__assumed=<optimized out>, this=<optimized out>) at /usr/include/c++/8/bits/atomic_futex.h:122
#4 std::__atomic_futex_unsigned<2147483648u>::_M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x55e5fe56e8d0)
at /usr/include/c++/8/bits/atomic_futex.h:162
#5 std::__future_base::_State_baseV2::wait (this=0x55e5fe56e8c0) at /usr/include/c++/8/future:337
#6 std::__basic_future<void>::_M_get_result (this=0x7ffc834390f0) at /usr/include/c++/8/future:717
#7 std::future<void>::get (this=0x7ffc834390f0) at /usr/include/c++/8/future:882
#8 lbann::generic_input_layer::fp_compute (this=0x55e5fe3a8710)
at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/include/lbann/layers/io/input/generic_input_layer.hpp:268
#9 0x00007f31f5954a37 in lbann::Layer::forward_prop (this=0x55e5fe3a8710)
at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/layers/layer.cpp:263
#10 0x00007f31f596a49e in lbann::model::forward_prop (this=0x55e5fe4bd4b0, mode=lbann::execution_mode::testing)
at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/models/model.cpp:1114
#11 0x00007f31f596aef6 in lbann::model::evaluate_mini_batch(lbann::execution_mode) ()
at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/models/model.cpp:1035
#12 0x00007f31f596f77f in lbann::model::evaluate(lbann::execution_mode, int) ()
at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/models/model.cpp:957
#13 0x000055e5fd9b19be in main () at /usr/include/c++/8/bits/unique_ptr.h:342
#14 0x00007f31ed67509b in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#15 0x000055e5fd9b22ba in _start () at /usr/include/c++/8/ext/atomicity.h:69
Should this be another issue, or just change the issue title ...?
I also used perf on the power machine with still LBANN 0.98.1, and I get:
#
# Total Lost Samples: 0
#
# Samples: 49K of event 'cycles:uppp'
# Event count (approx.): 46434771680
#
# Overhead Command Shared Object Symbol
# ........ .............. .................. ....................................................
#
49.86% lbann2 libAl.so [.] Al::internal::ProgressEngine::engine
33.03% lbann2 liblbann.so [.] lbann::lbann_image_preprocessor::unit_scale
17.09% lbann2 libHydrogen.so [.] El::Matrix<float, (El::Device)0>::do_get_device_
0.00% lbann2 libucs.so.0.0.0 [.] _init
0.00% cuda-EvtHandlr libcuda.so.396.37 [.] 0x00000000002324f4
0.00% cuda-EvtHandlr libcuda.so.396.37 [.] 0x00000000000b5c34
0.00% cuda-EvtHandlr libcuda.so.396.37 [.] 0x00000000002195b4
0.00% cuda-EvtHandlr libcuda.so.396.37 [.] 0x0000000000232fd0
0.00% lbann2 libmlx5.so.1.0.0 [.] 0x000000000002c194
0.00% lbann2 libucs.so.0.0.0 [.] __ucs_twheel_sweep
0.00% lbann2 libuct.so.0.0.0 [.] 0x000000000005a5d4
0.00% lbann2 libuct.so.0.0.0 [.] 0x00000000000644fc
0.00% lbann2 libpthread-2.17.so [.] pthread_spin_lock
0.00% lbann2 libmlx5.so.1.0.0 [.] 0x000000000002c080
0.00% cuda-EvtHandlr libc-2.17.so [.] __libc_enable_asynccancel
0.00% cuda-EvtHandlr libcuda.so.396.37 [.] 0x000000000010af50
0.00% lbann2 libuct.so.0.0.0 [.] _init
0.00% lbann2 libucs.so.0.0.0 [.] 0x0000000000037e74
0.00% cuda-EvtHandlr libcuda.so.396.37 [.] 0x00000000000b5c30
0.00% lbann2 libc-2.17.so [.] __libc_disable_asynccancel
0.00% cuda-EvtHandlr libc-2.17.so [.] __libc_disable_asynccancel
0.00% lbann2 libc-2.17.so [.] epoll_wait
0.00% cuda-EvtHandlr libc-2.17.so [.] __GI___libc_poll
0.00% lbann2 [kernel.kallsyms] [k] doorbell_exception
#
# (Tip: To count events in every 1000 msec: perf stat -I 1000)
#
# PID TID
I guess it's indeed the same problem as the one solved by #879 because I also see do_get_device (albeit with much less usage) being used. Will try out the same commits as I already tried on my notebook to confirm.
When trying out LBANN with a simple MNIST example with one hidden fc layer containing 20 neurons trained for one epoch, then LBANN seems to hang. I.e., it takes at least several minutes even though tensorflow ould have been much faster. I still have problems running perf with the version I built for a Power architecture and graphics cards, but I tried a kind of minimal LBANN by building it without Aluminum and without CUDA for my notebook and there it also hangs for at least 35 minutes at this message:
LBANN call:
System setup (singularity container on a system with no GPU, linux kernel 4.15.0 and i7-4600 CPU):
Singularity File:
```shell Bootstrap: docker From: debian:buster-slim %environment # elevating this to /bin/bash is not possible. Therefore should on ubuntu also be runnable in /bin/dash -.- PREFIX=/opt/lbann exportPath() { if test -d "$2"; then export "$1"="$2" printf "\e[37mExported existing path '$2' into environment variable '$1'\e[0m\n" else printf "\e[31m[Warning] '$2' is not a directory. Won't export it\e[0m\n" fi } add2path() { local targetVar=PATH if test "$#" -gt 1; then targetVar=$1 shift 1 fi local targetContent=$( eval echo \$$targetVar ) local oldContent=$targetContent while test "$#" -gt 0; do if test -d "$1"; then case ":$targetContent:" in *:"$1":*) printf "\e[37m[Info] Path '$1' already exists in \$$targetVar. Won't add it.\e[0m\n" ;; *) targetContent=$1:$targetContent ;; esac else printf "\e[33m[Warning] '$1' is not a directory. Won't append to \$$targetVar variable.\e[0m\n" fi shift 1 done if test "${#targetContent}" -gt "${#oldContent}"; then export $targetVar=$targetContent printf "\e[37mExporting new \$$targetVar: $targetContent\e[0m\n" elif test "${#targetContent}" -lt "${#oldContent}"; then printf "\e[31m[Error] After adding paths, the variable is erroneously shorter (${#targetContent}) than before (${#oldContent})"'!'"\e[0m\n" fi } findPath() { local fileName=$1 local searchPath=$2 if test "$( find "$searchPath" -xtype f -name "$fileName" | head -2 | wc -l )" -gt 1; then printf "\e[33m[Warning] Found more than one matching sub path in the searchPath '$searchPath'.\e[0m\n" 1>&2 printf "\e[37mMatches:\n" 1>&2 find "$searchPath" -xtype f -name "$fileName" 1>&2 printf "\e[0m\n" 1>&2 fi local matchingPath=$( find "$searchPath" -xtype f -name "$fileName" | head -1 ) printf '%s' "${matchingPath%$fileName}" } exportPath ALUMINUM_DIR "$PREFIX/Aluminum" exportPath CEREAL_DIR "$PREFIX/cereal" exportPath CNPY_DIR "$PREFIX/cnpy" exportPath CUB_DIR "$PREFIX"/cub-*/ exportPath HWLOC_DIR "$PREFIX/hwloc" exportPath HYDROGEN_DIR "$PREFIX/Elemental" exportPath OPENCV_DIR "$PREFIX/opencv" exportPath PROTOBUF_ROOT "$PREFIX/protobuf" if test -d "$PROTOBUF_ROOT"; then add2path 'PATH' "$PROTOBUF_ROOT/bin" add2path 'CMAKE_PREFIX_PATH' "$PROTOBUF_ROOT" PROTOBUF_LIB=$( find "$PROTOBUF_ROOT" -mindepth 1 -maxdepth 1 -type d -name 'lib*' | head -1 ) && add2path 'LIBRARY_PATH' "$PROTOBUF_LIB" add2path 'LD_LIBRARY_PATH' "$PROTOBUF_LIB" fi exportPath LBANN_DIR "$PREFIX/lbann" if test -d "$LBANN_DIR"; then add2path 'PATH' "$LBANN_DIR/bin" add2path 'CMAKE_PREFIX_PATH' "$LBANN_DIR" add2path 'LIBRARY_PATH' "$LBANN_DIR/lib" add2path 'LD_LIBRARY_PATH' "$LBANN_DIR/lib" fi add2path 'CMAKE_PREFIX_PATH' "$OPENCV_DIR" "$HYDROGEN_DIR" "$ALUMINUM_DIR" add2path 'PATH' "$PREFIX/cmake/bin" %post if test "$0" = "/bin/sh"; then echo "Elevating script to bash" sed -n -z '$p' "/proc/$$/cmdline" | sed 's/\x00/\n/g' | /bin/bash -ve exit $? fi apt-get -y update && apt-get -y install --no-install-recommends \ findutils sed grep coreutils curl ca-certificates tar dpkg wget cmake \ gcc g++ gfortran python make zlib*-dev libopenblas-dev libopenmpi-dev libprotobuf-dev protobuf-compiler liblapack-dev PREFIX="/opt/lbann" mkdir -p -- "$PREFIX/src" version-ge() { test "$1" = "$( printf '%s\n%s' "$1" "$2" | sort -V | tail -n 1 )"; } commandExists() { command -v "$@" &>/dev/null; } unzip(){ python -c "from zipfile import PyZipFile; PyZipFile( '''$1''' ).extractall()"; } remoteExtract() { local compression= local url="${@: -1}" local ext="$( printf '%s' "$url" | sed 's/\?.*//; s/.*\.//;' )" local iTry=5 for (( ; iTry > 0; iTry )); do case "$ext" in tgz|gz) compression=--gzip ;; xz) compression=--xz ;; tbz2|bz2) compression=--bzip2 ;; esac ( if command -v wget &>/dev/null; then wget -O- \ --retry-connrefused \ --timeout=5 \ --tries=5 \ --waitretry=5 \ --read-timeout=20 \ "$@" | tar -x $compression fi || if command -v curl &>/dev/null; then curl -L \ --connect-timeout 5 \ --max-time 20 \ --retry 5 \ --retry-delay 5 \ --retry-max-time 60 \ "$@" | tar -x $compression fi || false ) && break done } setupCub() { cd -- "$PREFIX" && if ! test -d cub-*; then remoteExtract 'https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz' fi && cd cub-* && export CUB_DIR=$( pwd ) } setupCereal() { export CEREAL_DIR="$PREFIX"/cereal && if ! test -d "$CEREAL_DIR"; then cd -- "$PREFIX/src" && remoteExtract 'https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz' && cd cereal-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$PREFIX"/cereal -DJUST_INSTALL_CEREAL=ON .. && make -j "$( nproc )" install fi } setupCnpy() { # commit 4e8810b1a8637695171ed346ce68f6984e585ef4 to be exact but has no release and only 1 commit in last year export CNPY_DIR="$PREFIX"/cnpy && if ! test -d "$CNPY_DIR"; then cd -- "$PREFIX/src" && curl -L 'https://github.com/rogersce/cnpy/archive/master.zip' -o master.zip && unzip "$_" && command rm -f "$_" && cd cnpy-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$CNPY_DIR" .. && make -j "$( nproc )" install fi } buildAluminum() { # allow Aluminum build to fail (requires at least CUDA 9 because it uses CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS) cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/Aluminum/archive/v0.2.tar.gz' && cd Aluminum-* && mkdir -p build && cd "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$ALUMINUM_DIR" \ -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \ .. && make -j "$( nproc )" VERBOSE=1 install || true } buildHydrogen() { # Needs at least CUDA 7.5 because it uses cuda_fp16.h even though Hydrogen_ENABLE_HALF=OFF Oo cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/Elemental/archive/v1.1.0.tar.gz' && cd Elemental-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$HYDROGEN_DIR" \ -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \ -DHydrogen_USE_64BIT_INTS=ON \ -DHydrogen_ENABLE_OPENMP=ON \ -DBUILD_SHARED_LIBS=ON \ -DHydrogen_ENABLE_ALUMINUM=OFF \ .. && make -j "$( nproc )" VERBOSE=1 install } buildOpenCV() { cd -- "$PREFIX/src" && remoteExtract 'https://github.com/opencv/opencv/archive/3.4.3.tar.gz' && cd opencv-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$OPENCV_DIR" \ -DWITH_{JPEG,PNG,TIFF}=ON \ -DWITH_{CUDA,JASPER}=OFF \ -DBUILD_SHARED_LIBS=ON \ -DBUILD_JAVA=OFF \ -DBUILD_opencv_{calib3d,cuda,dnn,features2d,flann,java,{java,python}_bindings_generator,ml,python{2,3},stitching,ts,superres,video{,io,stab}}=OFF .. && make -j "$( nproc )" install } buildLBANN() { fixLibZBug() { find . -type f -execdir bash -c ' if grep "g++.*libcnpy\.so" "$0" | grep -q -v " -lz"; then sed -i -r "/g\+\+ .*libcnpy\.so( |$)/{ s:(libcnpy\.so |$):\1-lz : }" "$0"; fi' {} \; } cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/lbann/archive/v0.98.1.tar.gz' && cd lbann-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$PREFIX"/lbann \ -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \ -DHydrogen_DIR="$HYDROGEN_DIR" \ -DLBANN_WITH_ALUMINUM:BOOL=OFF \ -DLBANN_USE_PROTOBUF_MODULE=$( if test -f "$PROTOBUF_ROOT/lib/cmake/protobuf/protobuf-config.cmake"; then echo OFF; else echo ON; fi ) .. && fixLibZBug && make -j 2 VERBOSE=1 install # only building with -j 2 instead of -j 4 because the VM on Taurus doesn't seem to have enough memory to run for compilations in parallel ... } setupCub setupCereal setupCnpy ALUMINUM_DIR="$PREFIX"/Aluminum && if ! test -d "$ALUMINUM_DIR"; then buildAluminum; fi && export CMAKE_PREFIX_PATH=$ALUMINUM_DIR:$CMAKE_PREFIX_PATH HYDROGEN_DIR="$PREFIX"/Elemental && if ! test -d "$HYDROGEN_DIR"; then buildHydrogen; fi && export CMAKE_PREFIX_PATH=$HYDROGEN_DIR:$CMAKE_PREFIX_PATH OPENCV_DIR="$PREFIX"/opencv if ! test -d "$OPENCV_DIR"; then buildOpenCV; fi && export CMAKE_PREFIX_PATH=$OPENCV_DIR:$CMAKE_PREFIX_PATH buildLBANN exit 0 ```and perf called with:
gives these results:
``` # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 18K of event 'cycles:ppp' # Event count (approx.): 14684521769 # # Overhead Command Shared Object Symbol # ........ ....... ................. .................................................................................... # 75.60% lbann2 libHydrogen.so [.] El::MatrixTherefore it seems to not be hanging but actually to be very very slow: 75% in
El::Matrix<float, (El::Device)0>::do_get_device_()``and 25% in
lbann::lbann_image_preprocessor::unit_scale`(On my notebook only one core is fully used, so that might be another problem? Why aren't all 4 cores used?)