LLNL / lbann

Livermore Big Artificial Neural Network Toolkit
http://software.llnl.gov/lbann/
Other
224 stars 79 forks source link

LBANN seems to "hang" in El::Matrix::do_get_device #914

Open mxmlnkn opened 5 years ago

mxmlnkn commented 5 years ago

When trying out LBANN with a simple MNIST example with one hidden fc layer containing 20 neurons trained for one epoch, then LBANN seems to hang. I.e., it takes at least several minutes even though tensorflow ould have been much faster. I still have problems running perf with the version I built for a Power architecture and graphics cards, but I tried a kind of minimal LBANN by building it without Aluminum and without CUDA for my notebook and there it also hangs for at least 35 minutes at this message:

STARTING train - model 1

--------------------------------------------------------------------------------
[0] Epoch : stats formated [tr/v/te] iter/epoch = [844/94/157]
            global MB = [  64/  64/  64] global last MB = [  48  /  48  /  16  ]
             local MB = [  64/  64/  64]  local last MB = [  48+0/  48+0/  16+0]
--------------------------------------------------------------------------------

LBANN call:

LBANN_MODEL_ZOO_DIR=$LBANN_DIR/share/model_zoo
MNIST_DIR=/media/f/Beruf/machine-learning/datasets/mnist
cp ${LBANN_MODEL_ZOO_DIR}/models/simple_mnist/model_mnist_simple_1.prototext mnist-1-fc-20.proto
sed -i -E 's|( num_neurons:) 500|\1 20|' mnist-1-fc-20.proto
lbann2 \
    --model=mnist-1-fc-20.proto \
    --optimizer=${LBANN_MODEL_ZOO_DIR}/optimizers/opt_sgd.prototext \
    --reader=${LBANN_MODEL_ZOO_DIR}/data_readers/data_reader_mnist.prototext \
    --data_filedir_train="$MNIST_DIR" \
    --data_filedir_test="$MNIST_DIR" \
    --num_epochs=1

System setup (singularity container on a system with no GPU, linux kernel 4.15.0 and i7-4600 CPU):

Singularity File: ```shell Bootstrap: docker From: debian:buster-slim %environment # elevating this to /bin/bash is not possible. Therefore should on ubuntu also be runnable in /bin/dash -.- PREFIX=/opt/lbann exportPath() { if test -d "$2"; then export "$1"="$2" printf "\e[37mExported existing path '$2' into environment variable '$1'\e[0m\n" else printf "\e[31m[Warning] '$2' is not a directory. Won't export it\e[0m\n" fi } add2path() { local targetVar=PATH if test "$#" -gt 1; then targetVar=$1 shift 1 fi local targetContent=$( eval echo \$$targetVar ) local oldContent=$targetContent while test "$#" -gt 0; do if test -d "$1"; then case ":$targetContent:" in *:"$1":*) printf "\e[37m[Info] Path '$1' already exists in \$$targetVar. Won't add it.\e[0m\n" ;; *) targetContent=$1:$targetContent ;; esac else printf "\e[33m[Warning] '$1' is not a directory. Won't append to \$$targetVar variable.\e[0m\n" fi shift 1 done if test "${#targetContent}" -gt "${#oldContent}"; then export $targetVar=$targetContent printf "\e[37mExporting new \$$targetVar: $targetContent\e[0m\n" elif test "${#targetContent}" -lt "${#oldContent}"; then printf "\e[31m[Error] After adding paths, the variable is erroneously shorter (${#targetContent}) than before (${#oldContent})"'!'"\e[0m\n" fi } findPath() { local fileName=$1 local searchPath=$2 if test "$( find "$searchPath" -xtype f -name "$fileName" | head -2 | wc -l )" -gt 1; then printf "\e[33m[Warning] Found more than one matching sub path in the searchPath '$searchPath'.\e[0m\n" 1>&2 printf "\e[37mMatches:\n" 1>&2 find "$searchPath" -xtype f -name "$fileName" 1>&2 printf "\e[0m\n" 1>&2 fi local matchingPath=$( find "$searchPath" -xtype f -name "$fileName" | head -1 ) printf '%s' "${matchingPath%$fileName}" } exportPath ALUMINUM_DIR "$PREFIX/Aluminum" exportPath CEREAL_DIR "$PREFIX/cereal" exportPath CNPY_DIR "$PREFIX/cnpy" exportPath CUB_DIR "$PREFIX"/cub-*/ exportPath HWLOC_DIR "$PREFIX/hwloc" exportPath HYDROGEN_DIR "$PREFIX/Elemental" exportPath OPENCV_DIR "$PREFIX/opencv" exportPath PROTOBUF_ROOT "$PREFIX/protobuf" if test -d "$PROTOBUF_ROOT"; then add2path 'PATH' "$PROTOBUF_ROOT/bin" add2path 'CMAKE_PREFIX_PATH' "$PROTOBUF_ROOT" PROTOBUF_LIB=$( find "$PROTOBUF_ROOT" -mindepth 1 -maxdepth 1 -type d -name 'lib*' | head -1 ) && add2path 'LIBRARY_PATH' "$PROTOBUF_LIB" add2path 'LD_LIBRARY_PATH' "$PROTOBUF_LIB" fi exportPath LBANN_DIR "$PREFIX/lbann" if test -d "$LBANN_DIR"; then add2path 'PATH' "$LBANN_DIR/bin" add2path 'CMAKE_PREFIX_PATH' "$LBANN_DIR" add2path 'LIBRARY_PATH' "$LBANN_DIR/lib" add2path 'LD_LIBRARY_PATH' "$LBANN_DIR/lib" fi add2path 'CMAKE_PREFIX_PATH' "$OPENCV_DIR" "$HYDROGEN_DIR" "$ALUMINUM_DIR" add2path 'PATH' "$PREFIX/cmake/bin" %post if test "$0" = "/bin/sh"; then echo "Elevating script to bash" sed -n -z '$p' "/proc/$$/cmdline" | sed 's/\x00/\n/g' | /bin/bash -ve exit $? fi apt-get -y update && apt-get -y install --no-install-recommends \ findutils sed grep coreutils curl ca-certificates tar dpkg wget cmake \ gcc g++ gfortran python make zlib*-dev libopenblas-dev libopenmpi-dev libprotobuf-dev protobuf-compiler liblapack-dev PREFIX="/opt/lbann" mkdir -p -- "$PREFIX/src" version-ge() { test "$1" = "$( printf '%s\n%s' "$1" "$2" | sort -V | tail -n 1 )"; } commandExists() { command -v "$@" &>/dev/null; } unzip(){ python -c "from zipfile import PyZipFile; PyZipFile( '''$1''' ).extractall()"; } remoteExtract() { local compression= local url="${@: -1}" local ext="$( printf '%s' "$url" | sed 's/\?.*//; s/.*\.//;' )" local iTry=5 for (( ; iTry > 0; iTry )); do case "$ext" in tgz|gz) compression=--gzip ;; xz) compression=--xz ;; tbz2|bz2) compression=--bzip2 ;; esac ( if command -v wget &>/dev/null; then wget -O- \ --retry-connrefused \ --timeout=5 \ --tries=5 \ --waitretry=5 \ --read-timeout=20 \ "$@" | tar -x $compression fi || if command -v curl &>/dev/null; then curl -L \ --connect-timeout 5 \ --max-time 20 \ --retry 5 \ --retry-delay 5 \ --retry-max-time 60 \ "$@" | tar -x $compression fi || false ) && break done } setupCub() { cd -- "$PREFIX" && if ! test -d cub-*; then remoteExtract 'https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz' fi && cd cub-* && export CUB_DIR=$( pwd ) } setupCereal() { export CEREAL_DIR="$PREFIX"/cereal && if ! test -d "$CEREAL_DIR"; then cd -- "$PREFIX/src" && remoteExtract 'https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz' && cd cereal-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$PREFIX"/cereal -DJUST_INSTALL_CEREAL=ON .. && make -j "$( nproc )" install fi } setupCnpy() { # commit 4e8810b1a8637695171ed346ce68f6984e585ef4 to be exact but has no release and only 1 commit in last year export CNPY_DIR="$PREFIX"/cnpy && if ! test -d "$CNPY_DIR"; then cd -- "$PREFIX/src" && curl -L 'https://github.com/rogersce/cnpy/archive/master.zip' -o master.zip && unzip "$_" && command rm -f "$_" && cd cnpy-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$CNPY_DIR" .. && make -j "$( nproc )" install fi } buildAluminum() { # allow Aluminum build to fail (requires at least CUDA 9 because it uses CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS) cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/Aluminum/archive/v0.2.tar.gz' && cd Aluminum-* && mkdir -p build && cd "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$ALUMINUM_DIR" \ -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \ .. && make -j "$( nproc )" VERBOSE=1 install || true } buildHydrogen() { # Needs at least CUDA 7.5 because it uses cuda_fp16.h even though Hydrogen_ENABLE_HALF=OFF Oo cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/Elemental/archive/v1.1.0.tar.gz' && cd Elemental-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$HYDROGEN_DIR" \ -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \ -DHydrogen_USE_64BIT_INTS=ON \ -DHydrogen_ENABLE_OPENMP=ON \ -DBUILD_SHARED_LIBS=ON \ -DHydrogen_ENABLE_ALUMINUM=OFF \ .. && make -j "$( nproc )" VERBOSE=1 install } buildOpenCV() { cd -- "$PREFIX/src" && remoteExtract 'https://github.com/opencv/opencv/archive/3.4.3.tar.gz' && cd opencv-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$OPENCV_DIR" \ -DWITH_{JPEG,PNG,TIFF}=ON \ -DWITH_{CUDA,JASPER}=OFF \ -DBUILD_SHARED_LIBS=ON \ -DBUILD_JAVA=OFF \ -DBUILD_opencv_{calib3d,cuda,dnn,features2d,flann,java,{java,python}_bindings_generator,ml,python{2,3},stitching,ts,superres,video{,io,stab}}=OFF .. && make -j "$( nproc )" install } buildLBANN() { fixLibZBug() { find . -type f -execdir bash -c ' if grep "g++.*libcnpy\.so" "$0" | grep -q -v " -lz"; then sed -i -r "/g\+\+ .*libcnpy\.so( |$)/{ s:(libcnpy\.so |$):\1-lz : }" "$0"; fi' {} \; } cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/lbann/archive/v0.98.1.tar.gz' && cd lbann-* && mkdir -p build && cd -- "$_" && cmake -Wno-dev \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX="$PREFIX"/lbann \ -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH" \ -DHydrogen_DIR="$HYDROGEN_DIR" \ -DLBANN_WITH_ALUMINUM:BOOL=OFF \ -DLBANN_USE_PROTOBUF_MODULE=$( if test -f "$PROTOBUF_ROOT/lib/cmake/protobuf/protobuf-config.cmake"; then echo OFF; else echo ON; fi ) .. && fixLibZBug && make -j 2 VERBOSE=1 install # only building with -j 2 instead of -j 4 because the VM on Taurus doesn't seem to have enough memory to run for compilations in parallel ... } setupCub setupCereal setupCnpy ALUMINUM_DIR="$PREFIX"/Aluminum && if ! test -d "$ALUMINUM_DIR"; then buildAluminum; fi && export CMAKE_PREFIX_PATH=$ALUMINUM_DIR:$CMAKE_PREFIX_PATH HYDROGEN_DIR="$PREFIX"/Elemental && if ! test -d "$HYDROGEN_DIR"; then buildHydrogen; fi && export CMAKE_PREFIX_PATH=$HYDROGEN_DIR:$CMAKE_PREFIX_PATH OPENCV_DIR="$PREFIX"/opencv if ! test -d "$OPENCV_DIR"; then buildOpenCV; fi && export CMAKE_PREFIX_PATH=$OPENCV_DIR:$CMAKE_PREFIX_PATH buildLBANN exit 0 ```

and perf called with:

sudo timeout 5s perf record -a --call-graph dwarf -p $( pgrep lbann2 | tail -1 )
perf report --no-children -T -i perf.data | c++filt > perf-callgraph-demangled-switching-to-io-events-file-view.tx
gives these results: ``` # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 18K of event 'cycles:ppp' # Event count (approx.): 14684521769 # # Overhead Command Shared Object Symbol # ........ ....... ................. .................................................................................... # 75.60% lbann2 libHydrogen.so [.] El::Matrix::do_get_device_() const | ---El::Matrix::do_get_device_() const | --75.57%--0x563035fc3a3f 23.81% lbann2 liblbann.so [.] lbann::lbann_image_preprocessor::unit_scale(El::Matrix&, unsigned int) | --23.80%--lbann::lbann_image_preprocessor::unit_scale(El::Matrix&, unsigned int) | |--20.53%--?? (inlined) | lbann::lbann_image_preprocessor::unit_scale(El::Matrix&, unsigned int) | | | --20.42%--0x563035fc3a3f | --3.27%--lbann::lbann_image_preprocessor::unit_scale(El::Matrix&, unsigned int) lbann::lbann_image_preprocessor::normalize(El::Matrix&, unsigned int) lbann::mnist_reader::fetch_datum(El::Matrix&, int, int) lbann::generic_data_reader::fetch_data_block(El::Matrix&, long long, long long, El::Matrix&) lbann::generic_data_reader::fetch_data(El::Matrix&, El::Matrix&) lbann::fetch_data_functor::operator()(El::Matrix&, El::Matrix&, El::Matrix&, lbann::generic_data_reader*) const lbann::partitioned_io_buffer::fetch_to_local_matrix(lbann::generic_data_reader*, lbann::execution_mode) lbann::generic_input_layer::fetch_data_in_background(int, lbann::execution_mode) std::_Function_handler (), std::__future_base::_Task_setter, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state, std::allocator, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) ?? (inlined) ?? (inlined) ?? (inlined) ?? (inlined) ?? (inlined) std::__future_base::_Task_state, std::allocator, void ()>::_M_run()::{lambda()#1}::operator()() const (inlined) ?? (inlined) std::_Function_handler (), std::__future_base::_Task_setter, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state, std::allocator, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) std::__future_base::_State_baseV2::_M_do_set(std::function ()>*, bool*) std::__future_base::_State_baseV2::_M_do_set(std::function ()>*, bool*) 0xf946 0.05% lbann2 [kernel.kallsyms] [k] 0xffffffffad945c92 0.04% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d77e9 0.04% lbann2 [kernel.kallsyms] [k] 0xffffffffad945c4a 0.03% lbann2 [kernel.kallsyms] [k] 0xffffffffad945d02 0.02% lbann2 [kernel.kallsyms] [k] 0xffffffffadf9e392 0.02% lbann2 [kernel.kallsyms] [k] 0xffffffffae0009e7 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d6594 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad945cd1 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf80373 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf82cb1 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac166a 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d1bf1 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7e01b4 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf9e486 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f5bad 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d563b 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffada4b070 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffada4dd59 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d77c4 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffada564c2 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f4ffd 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffada52f54 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f5012 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d779d 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac167c 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad70e345 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d560f 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffaddea869 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d5b3a 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac1626 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad945ca2 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffae002d6c 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf9e206 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f5b85 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad945c3a 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f5bb6 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d77ad 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d5600 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f5b8b 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad70c368 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad6b860a 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad945c87 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadcbbaad 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7d689c 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf80379 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf7b379 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad863a55 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffada5aee2 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffaddc7830 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7e0108 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7cc485 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad70f2eb 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad6d23b5 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad6af485 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac2647 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac1630 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7cc498 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadc12a93 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f731a 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf830a5 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7f5ecb 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf9e215 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad84583c 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7dffab 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffada4def0 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac1620 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7e01ad 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadcc86ef 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffae2031a0 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad70c2e5 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadf9e21a 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffad7017bd 0.01% lbann2 [kernel.kallsyms] [k] 0xffffffffadac167e 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffad945c44 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffad6fddb6 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffada933f9 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffae00099f 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffae001a4d 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffad60cb31 0.00% lbann2 [kernel.kallsyms] [k] 0xffffffffad66c1a8 # # (Tip: Profiling branch (mis)predictions with: perf record -b / perf report) # # PID TID ```

Therefore it seems to not be hanging but actually to be very very slow: 75% in El::Matrix<float, (El::Device)0>::do_get_device_()``and 25% inlbann::lbann_image_preprocessor::unit_scale`

(On my notebook only one core is fully used, so that might be another problem? Why aren't all 4 cores used?)

timmoon10 commented 5 years ago

This looks like the hang described in #879. I don't understand why El::Matrix::do_get_device_() would have a problem since it is a one-line function that just returns an enum (https://github.com/LLNL/Elemental/blob/3b784761bfb587f52ea1d6f882cb6905083bf795/include/El/core/Matrix/impl_cpu.hpp#L230). This will require some digging.

mxmlnkn commented 5 years ago

Thanks for the answer. I updated LBANN to commit 10b84da933a7b62e63120c6f9067df17cd9ba5f3 and Elemental to 6d4bc32515087ed7c8c1dd2687dd2cc771c139d3 and now it works ... sometimes. Let's say 20% of the time, LBANN hangs after an epoch ended. It looks to me like this is a race condition.

E.g. output before it hangs: ``` -------------------------------------------------------------------------------- [0] Epoch : stats formated [tr/v/te] iter/epoch = [844/94/157] global MB = [ 64/ 64/ 64] global last MB = [ 48 / 48 / 16 ] local MB = [ 64/ 64/ 64] local last MB = [ 48+0/ 48+0/ 16+0] -------------------------------------------------------------------------------- model0 (instance 0) training epoch 0 objective function : 0.458898 model0 (instance 0) training epoch 0 categorical accuracy : 86.8796% model0 (instance 0) training epoch 0 run time : 2.53483s model0 (instance 0) training epoch 0 mini-batch time statistics : 0.00280938s mean, 0.0282861s max, 0.00159096s min, 0.00129718s stdev model0 (instance 0) validation objective function : 0.297776 model0 (instance 0) validation categorical accuracy : 91.6833% model0 (instance 0) validation run time : 0.170268s model0 (instance 0) validation mini-batch time statistics : 0.00179428s mean, 0.0096336s max, 0.000733707s min, 0.00104509s stdev -------------------------------------------------------------------------------- [1] Epoch : stats formated [tr/v/te] iter/epoch = [844/94/157] global MB = [ 64/ 64/ 64] global last MB = [ 48 / 48 / 16 ] local MB = [ 64/ 64/ 64] local last MB = [ 48+0/ 48+0/ 16+0] -------------------------------------------------------------------------------- model0 (instance 0) training epoch 1 objective function : 0.279358 model0 (instance 0) training epoch 1 categorical accuracy : 92.2278% model0 (instance 0) training epoch 1 run time : 2.60129s model0 (instance 0) training epoch 1 mini-batch time statistics : 0.00290768s mean, 0.0292201s max, 0.0014437s min, 0.00153953s stdev ```

I tried the same thing with perf but as seeing that the CPU is not busy I almost knew that it wouldn't work. I only get:

sudo timeout 20s perf record -a --call-graph dwarf -p $( pgrep lbann2 | tail -1 )

Warning:
PID/TID switch overriding SYSTEM[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.750 MB perf.data ]

/usr/bin/perf report --no-children -T -i perf.data | c++filt

Error:
The perf.data file has no samples!
# To display the perf.data header info, please use --header/--header-only options.
#

Doing a backtrace from outside the container by using sudo gdb "$lbannBinPath $( pidof "$lbannBinPath" ), and then bt, I get:

#0  0x00007f31ed745289 in syscall () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f31eda68b48 in std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from target:/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f31f59cb799 in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test_until (__ns=..., __s=...,
    __has_timeout=<optimized out>, __mo=<optimized out>, __equal=<optimized out>, __operand=<optimized out>, __assumed=<optimized out>,
    this=<optimized out>) at /usr/include/c++/8/bits/atomic_base.h:542
#3  std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test (__mo=<optimized out>, __equal=<optimized out>, __operand=<optimized out>,
    __assumed=<optimized out>, this=<optimized out>) at /usr/include/c++/8/bits/atomic_futex.h:122
#4  std::__atomic_futex_unsigned<2147483648u>::_M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x55e5fe56e8d0)
    at /usr/include/c++/8/bits/atomic_futex.h:162
#5  std::__future_base::_State_baseV2::wait (this=0x55e5fe56e8c0) at /usr/include/c++/8/future:337
#6  std::__basic_future<void>::_M_get_result (this=0x7ffc834390f0) at /usr/include/c++/8/future:717
#7  std::future<void>::get (this=0x7ffc834390f0) at /usr/include/c++/8/future:882
#8  lbann::generic_input_layer::fp_compute (this=0x55e5fe3a8710)
    at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/include/lbann/layers/io/input/generic_input_layer.hpp:268
#9  0x00007f31f5954a37 in lbann::Layer::forward_prop (this=0x55e5fe3a8710)
    at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/layers/layer.cpp:263
#10 0x00007f31f596a49e in lbann::model::forward_prop (this=0x55e5fe4bd4b0, mode=lbann::execution_mode::testing)
    at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/models/model.cpp:1114
#11 0x00007f31f596aef6 in lbann::model::evaluate_mini_batch(lbann::execution_mode) ()
    at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/models/model.cpp:1035
#12 0x00007f31f596f77f in lbann::model::evaluate(lbann::execution_mode, int) ()
    at /opt/lbann/src/lbann-10b84da933a7b62e63120c6f9067df17cd9ba5f3/src/models/model.cpp:957
#13 0x000055e5fd9b19be in main () at /usr/include/c++/8/bits/unique_ptr.h:342
#14 0x00007f31ed67509b in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#15 0x000055e5fd9b22ba in _start () at /usr/include/c++/8/ext/atomicity.h:69

Should this be another issue, or just change the issue title ...?

I also used perf on the power machine with still LBANN 0.98.1, and I get:

#
# Total Lost Samples: 0
#
# Samples: 49K of event 'cycles:uppp'
# Event count (approx.): 46434771680
#
# Overhead  Command         Shared Object       Symbol                                              
# ........  ..............  ..................  ....................................................
#
    49.86%  lbann2          libAl.so            [.] Al::internal::ProgressEngine::engine
    33.03%  lbann2          liblbann.so         [.] lbann::lbann_image_preprocessor::unit_scale
    17.09%  lbann2          libHydrogen.so      [.] El::Matrix<float, (El::Device)0>::do_get_device_
     0.00%  lbann2          libucs.so.0.0.0     [.] _init
     0.00%  cuda-EvtHandlr  libcuda.so.396.37   [.] 0x00000000002324f4
     0.00%  cuda-EvtHandlr  libcuda.so.396.37   [.] 0x00000000000b5c34
     0.00%  cuda-EvtHandlr  libcuda.so.396.37   [.] 0x00000000002195b4
     0.00%  cuda-EvtHandlr  libcuda.so.396.37   [.] 0x0000000000232fd0
     0.00%  lbann2          libmlx5.so.1.0.0    [.] 0x000000000002c194
     0.00%  lbann2          libucs.so.0.0.0     [.] __ucs_twheel_sweep
     0.00%  lbann2          libuct.so.0.0.0     [.] 0x000000000005a5d4
     0.00%  lbann2          libuct.so.0.0.0     [.] 0x00000000000644fc
     0.00%  lbann2          libpthread-2.17.so  [.] pthread_spin_lock
     0.00%  lbann2          libmlx5.so.1.0.0    [.] 0x000000000002c080
     0.00%  cuda-EvtHandlr  libc-2.17.so        [.] __libc_enable_asynccancel
     0.00%  cuda-EvtHandlr  libcuda.so.396.37   [.] 0x000000000010af50
     0.00%  lbann2          libuct.so.0.0.0     [.] _init
     0.00%  lbann2          libucs.so.0.0.0     [.] 0x0000000000037e74
     0.00%  cuda-EvtHandlr  libcuda.so.396.37   [.] 0x00000000000b5c30
     0.00%  lbann2          libc-2.17.so        [.] __libc_disable_asynccancel
     0.00%  cuda-EvtHandlr  libc-2.17.so        [.] __libc_disable_asynccancel
     0.00%  lbann2          libc-2.17.so        [.] epoll_wait
     0.00%  cuda-EvtHandlr  libc-2.17.so        [.] __GI___libc_poll
     0.00%  lbann2          [kernel.kallsyms]   [k] doorbell_exception

#
# (Tip: To count events in every 1000 msec: perf stat -I 1000)
#
# PID  TID

I guess it's indeed the same problem as the one solved by #879 because I also see do_get_device (albeit with much less usage) being used. Will try out the same commits as I already tried on my notebook to confirm.