Tests SegFaults on pre-build Docker image

BoneGoat commented 4 years ago

I have been able to run W2L before on this machine but now everything seems to go wrong. Any clue as to where I should start looking?

After some digging maybe W2L isn't Cuda 10 compatible? I've tried to build my own Docker image with FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu16.04 but the tests still SegFaults. Is there anything else I need to change in the Dockerfile?

Just noticed the following error when building ArrayFire:

-- Automatic GPU detection failed. Building for common architectures.
-- CUDA_architecture_build_targets: 3.0;3.5;5.0;5.2;6.0;6.1;7.0;7.0+PTX
-- CUDA driver library missing. Looking for libcuda stub.
-- CUDA driver stub FOUND: /usr/local/cuda/lib64/stubs/libcuda.so

tobias@tifa:~$ docker run --runtime=nvidia -v /mnt/data:/root/data --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm wav2letter/wav2letter:cuda-latest
root@b68215068965:/# nvidia-smi
Thu Jan 23 18:50:46 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 26%   44C    P0    20W / 260W |      0MiB / 11018MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@b68215068965:/# cd /root/wav2letter/build/
root@b68215068965:~/wav2letter/build# make test
Running tests...
Test project /root/wav2letter/build
      Start  1: W2lCommonTest
 1/30 Test  #1: W2lCommonTest ....................***Exception: SegFault  1.91 sec
      Start  2: DictionaryTest
 2/30 Test  #2: DictionaryTest ...................   Passed    0.02 sec
      Start  3: CriterionTest
 3/30 Test  #3: CriterionTest ....................***Exception: SegFault  1.22 sec
      Start  4: Seq2SeqTest
 4/30 Test  #4: Seq2SeqTest ......................***Exception: SegFault  1.15 sec
      Start  5: AttentionTest
 5/30 Test  #5: AttentionTest ....................***Exception: SegFault  1.35 sec
      Start  6: WindowTest
 6/30 Test  #6: WindowTest .......................***Exception: SegFault  1.20 sec
      Start  7: DataTest
 7/30 Test  #7: DataTest .........................***Failed    1.21 sec
      Start  8: ListFileDatasetTest
 8/30 Test  #8: ListFileDatasetTest ..............   Passed    1.11 sec
      Start  9: SoundTest
 9/30 Test  #9: SoundTest ........................   Passed    0.07 sec
      Start 10: DecoderTest
10/30 Test #10: DecoderTest ......................   Passed    1.21 sec
      Start 11: CeplifterTest
11/30 Test #11: CeplifterTest ....................   Passed    0.02 sec
      Start 12: DctTest
12/30 Test #12: DctTest ..........................   Passed    0.04 sec
      Start 13: DerivativesTest
13/30 Test #13: DerivativesTest ..................   Passed    0.02 sec
      Start 14: DitherTest
14/30 Test #14: DitherTest .......................   Passed    8.03 sec
      Start 15: MfccTest
15/30 Test #15: MfccTest .........................   Passed    0.19 sec
      Start 16: PreEmphasisTest
16/30 Test #16: PreEmphasisTest ..................   Passed    0.02 sec
      Start 17: SpeechUtilsTest
17/30 Test #17: SpeechUtilsTest ..................***Failed    1.24 sec
      Start 18: TriFilterbankTest
18/30 Test #18: TriFilterbankTest ................   Passed    0.04 sec
      Start 19: WindowingTest
19/30 Test #19: WindowingTest ....................   Passed    0.02 sec
      Start 20: W2lModuleTest
20/30 Test #20: W2lModuleTest ....................***Exception: SegFault  1.95 sec
      Start 21: RuntimeTest
21/30 Test #21: RuntimeTest ......................***Failed    8.86 sec
      Start 22: inference_Conv1dTest
22/30 Test #22: inference_Conv1dTest .............   Passed    0.03 sec
      Start 23: inference_IdentityTest
23/30 Test #23: inference_IdentityTest ...........   Passed    0.00 sec
      Start 24: inference_LayerNormTest
24/30 Test #24: inference_LayerNormTest ..........   Passed    0.01 sec
      Start 25: inference_LinearTest
25/30 Test #25: inference_LinearTest .............   Passed    0.00 sec
      Start 26: inference_LogMelFeatureTest
26/30 Test #26: inference_LogMelFeatureTest ......   Passed    0.03 sec
      Start 27: inference_MemoryManagerTest
27/30 Test #27: inference_MemoryManagerTest ......   Passed    0.00 sec
      Start 28: inference_ReluTest
28/30 Test #28: inference_ReluTest ...............   Passed    0.00 sec
      Start 29: inference_ResidualTest
29/30 Test #29: inference_ResidualTest ...........   Passed    0.00 sec
      Start 30: inference_TDSBlockTest
30/30 Test #30: inference_TDSBlockTest ...........   Passed    0.00 sec

70% tests passed, 9 tests failed out of 30

Total Test time (real) =  30.98 sec

The following tests FAILED:
      1 - W2lCommonTest (SEGFAULT)
      3 - CriterionTest (SEGFAULT)
      4 - Seq2SeqTest (SEGFAULT)
      5 - AttentionTest (SEGFAULT)
      6 - WindowTest (SEGFAULT)
      7 - DataTest (Failed)
     17 - SpeechUtilsTest (Failed)
     20 - W2lModuleTest (SEGFAULT)
     21 - RuntimeTest (Failed)
Errors while running CTest
Makefile:104: recipe for target 'test' failed
make: *** [test] Error 8
root@b68215068965:~/wav2letter/build# cd /root/flashlight/build/
root@b68215068965:~/flashlight/build# make test
Running tests...
Test project /root/flashlight/build
      Start  1: AutogradTest
 1/13 Test  #1: AutogradTest .....................***Exception: SegFault  1.21 sec
      Start  2: DevicePtrTest
 2/13 Test  #2: DevicePtrTest ....................   Passed    1.14 sec
      Start  3: SerializationTest
 3/13 Test  #3: SerializationTest ................   Passed    0.02 sec
      Start  4: OptimTest
 4/13 Test  #4: OptimTest ........................***Exception: SegFault  1.20 sec
      Start  5: ModuleTest
 5/13 Test  #5: ModuleTest .......................***Exception: SegFault  2.07 sec
      Start  6: NNSerializationTest
 6/13 Test  #6: NNSerializationTest ..............***Exception: SegFault  1.19 sec
      Start  7: NNUtilsTest
 7/13 Test  #7: NNUtilsTest ......................***Failed    1.11 sec
      Start  8: DatasetTest
 8/13 Test  #8: DatasetTest ......................***Exception: SegFault  1.21 sec
      Start  9: DatasetUtilsTest
 9/13 Test  #9: DatasetUtilsTest .................   Passed    0.02 sec
      Start 10: MeterTest
10/13 Test #10: MeterTest ........................***Failed    1.06 sec
      Start 11: AllReduceTest
11/13 Test #11: AllReduceTest ....................***Exception: SegFault  1.31 sec
      Start 12: ContribModuleTest
12/13 Test #12: ContribModuleTest ................***Exception: SegFault  1.91 sec
      Start 13: ContribSerializationTest
13/13 Test #13: ContribSerializationTest .........***Exception: SegFault  1.22 sec

23% tests passed, 10 tests failed out of 13

Total Test time (real) =  14.67 sec

The following tests FAILED:
      1 - AutogradTest (SEGFAULT)
      4 - OptimTest (SEGFAULT)
      5 - ModuleTest (SEGFAULT)
      6 - NNSerializationTest (SEGFAULT)
      7 - NNUtilsTest (Failed)
      8 - DatasetTest (SEGFAULT)
     10 - MeterTest (Failed)
     11 - AllReduceTest (SEGFAULT)
     12 - ContribModuleTest (SEGFAULT)
     13 - ContribSerializationTest (SEGFAULT)
Errors while running CTest
Makefile:71: recipe for target 'test' failed
make: *** [test] Error 8
root@b68215068965:~/flashlight/build# cd tests/
root@b68215068965:~/flashlight/build/tests# ./AutogradTest
[==========] Running 55 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 55 tests from AutogradTest
[ RUN      ] AutogradTest.AfRefCountBasic
/root/flashlight/tests/autograd/AutogradTest.cpp:73: Failure
Expected equality of these values:
  refCount
    Which is: 0
  1
[  FAILED  ] AutogradTest.AfRefCountBasic (902 ms)
[ RUN      ] AutogradTest.AfRefCountModify
/root/flashlight/tests/autograd/AutogradTest.cpp:95: Failure
Expected equality of these values:
  refCount
    Which is: 0
  1
[  FAILED  ] AutogradTest.AfRefCountModify (1 ms)
[ RUN      ] AutogradTest.AfRefCountGradient
Segmentation fault (core dumped)

root@9eb255f5a953:~/flashlight/build/tests# ./OptimTest
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from OptimTest
[ RUN      ] OptimTest.GradNorm
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Kernel cuda::buildKernel(int, const string&, const string&, const std::vector<std::__cxx11::basic_string<char> >&, bool)
In file src/backend/cuda/nvrtc/cache.cpp:160
NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION

In function double af::norm(const af::array&, af::normType, double, double)
In file src/api/cpp/lapack.cpp:135" thrown in the test body.
[  FAILED  ] OptimTest.GradNorm (921 ms)
[----------] 1 test from OptimTest (921 ms total)

[----------] 1 test from SerializationTest
[ RUN      ] SerializationTest.OptimizerSerialize
Segmentation fault (core dumped)

tlikhomanenko commented 4 years ago

Hi @BoneGoat,

W2L is Cuda10 compatible. You can try to use base image wav2letter/wav2letter:cuda-base-10-latest and then install the flashilight and w2l inside it or use wav2letter/wav2letter:cuda-10-latest, but you need to do git pull, cmake and make for flashilight and wav2letter.

Maybe the simplest thing is to use wav2letter/wav2letter:cuda-base-10-latest and then follow instructions of installation from https://github.com/facebookresearch/wav2letter/blob/master/Dockerfile-CUDA

Also adding here the Dockerfiles which I used previously.

cuda-base-10 flashlight

# ==================================================================
# module list
# ------------------------------------------------------------------
# Ubuntu           16.04
# CUDA             9.2
# CuDNN            7-dev
# arrayfire        3.6.4    (git, CUDA backend)
# OpenMPI          latest   (apt)
# ==================================================================

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

RUN APT_INSTALL="apt-get install -y --no-install-recommends" && \
    rm -rf /var/lib/apt/lists/* \
           /etc/apt/sources.list.d/cuda.list \
           /etc/apt/sources.list.d/nvidia-ml.list && \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive $APT_INSTALL \
        build-essential \
        ca-certificates \
        cmake \
        wget \
        git \
        vim \
        emacs \
        nano \
        htop \
        g++ \
        # ssh for OpenMPI
        openssh-server openssh-client \
        # OpenMPI
        libopenmpi-dev libomp-dev \
        # nccl: for flashlight
        libnccl2 libnccl-dev \
        libglfw3-dev && \
# ==================================================================
# arrayfire https://github.com/arrayfire/arrayfire/wiki/
# ------------------------------------------------------------------
    cd /tmp && git clone --recursive https://github.com/arrayfire/arrayfire.git && \
    cd arrayfire && git checkout v3.6.4 && git submodule update --init --recursive && \
    mkdir build && cd build && \
    CXXFLAGS=-DOS_LNX cmake .. -DCMAKE_BUILD_TYPE=Release -DAF_BUILD_CPU=OFF -DAF_BUILD_OPENCL=OFF -DAF_BUILD_EXAMPLES=OFF && \
    make -j8 && \
    make install && \
# ==================================================================
# config & cleanup
# ------------------------------------------------------------------
    ldconfig && \
    apt-get clean && \
    apt-get autoremove && \
    rm -rf /var/lib/apt/lists/* /tmp/* ~/* && \

    # If the driver is not found (during docker build) the cuda driver api need to be linked against the
    # libcuda.so stub located in the lib[64]/stubs directory
    ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1

cuda-base-10

# ==================================================================
# module list
# ------------------------------------------------------------------
# inherit from flml/flashlight:cuda-base-10-latest
# python           3.6          (apt)
# libsndfile       bef2abc      (git)
# MKL              2018.4.057   (apt)
# FFTW             latest       (apt)
# KenLM            e47088d      (git)
# GLOG             latest       (apt)
# gflags           latest       (apt)
# python           3.6          (apt)
# ==================================================================

FROM flml/flashlight:cuda-base-10-latest

RUN APT_INSTALL="apt-get install -y --no-install-recommends" && \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive $APT_INSTALL \
        # for libsndfile
        autoconf automake autogen build-essential libasound2-dev \
        libflac-dev libogg-dev libtool libvorbis-dev pkg-config python \
        # for Intel's Math Kernel Library (MKL)
        cpio \
        # FFTW
        libfftw3-dev \
        # for kenlm
        zlib1g-dev libbz2-dev liblzma-dev libboost-all-dev \
        # gflags
        libgflags-dev libgflags2v5 \
        # for glog
        libgoogle-glog-dev libgoogle-glog0v5 \
        # for receipts data processing
        sox && \
# ==================================================================
# python (for receipts data processing)
# ------------------------------------------------------------------
    PIP_INSTALL="python3 -m pip --no-cache-dir install --upgrade" && \
    DEBIAN_FRONTEND=noninteractive $APT_INSTALL \
        software-properties-common \
        && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive $APT_INSTALL \
        python3.6 \
        python3.6-dev \
        && \
    wget -O ~/get-pip.py \
        https://bootstrap.pypa.io/get-pip.py && \
    python3.6 ~/get-pip.py && \
    ln -s /usr/bin/python3.6 /usr/local/bin/python3 && \
    ln -s /usr/bin/python3.6 /usr/local/bin/python && \
    $PIP_INSTALL \
        setuptools \
        && \
    $PIP_INSTALL \
        sox \
        tqdm && \
# ==================================================================
# libsndfile https://github.com/erikd/libsndfile.git
# ------------------------------------------------------------------
    cd /tmp && git clone https://github.com/erikd/libsndfile.git && \
    cd libsndfile && git checkout bef2abc9e888142203953addc31c50a192e496e5 && \
    ./autogen.sh && ./configure --enable-werror && \
    make && make check && make install && \
# ==================================================================
# MKL https://software.intel.com/en-us/mkl
# ------------------------------------------------------------------
    cd /tmp && wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
    apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
    wget https://apt.repos.intel.com/setup/intelproducts.list -O /etc/apt/sources.list.d/intelproducts.list && \
    sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
    apt-get update && DEBIAN_FRONTEND=noninteractive $APT_INSTALL intel-mkl-64bit-2018.4-057 && \
# ==================================================================
# KenLM https://github.com/kpu/kenlm
# ------------------------------------------------------------------
    cd /root && git clone https://github.com/kpu/kenlm.git && \
    cd kenlm && git checkout e47088ddfae810a5ee4c8a9923b5f8071bed1ae8 && \
    mkdir build && cd build && \
    cmake .. && \
    make -j8 && make install && \
# ==================================================================
# config & cleanup
# ------------------------------------------------------------------
    ldconfig && \
    apt-get clean && \
    apt-get autoremove && \
    rm -rf /var/lib/apt/lists/* /tmp/*

cuda-10

# ==================================================================
# module list
# ------------------------------------------------------------------
# flashlight       master       (git, CUDA backend)
# ==================================================================

FROM wav2letter/wav2letter:cuda-base-10-latest

RUN mkdir /root/wav2letter
COPY . /root/wav2letter

# ==================================================================
# flashlight https://github.com/facebookresearch/flashlight.git
# ------------------------------------------------------------------
RUN cd /root && git clone --recursive https://github.com/facebookresearch/flashlight.git && \
    cd /root/flashlight && mkdir -p build && \
    cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DFLASHLIGHT_BACKEND=CUDA && \
    make -j8 && make install && \
# ==================================================================
# wav2letter with GPU backend
# ------------------------------------------------------------------
    export MKLROOT=/opt/intel/mkl && export KENLM_ROOT_DIR=/root/kenlm && \
    cd /root/wav2letter && mkdir -p build && \
    cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DW2L_CRITERION_BACKEND=CUDA && \
    make -j8

BoneGoat commented 4 years ago

@tlikhomanenko Thanks for the Docker files! I built my own Dockerfile which look very similar to yours. The only differences I can spot are 10.2-cudnn7-devel-ubuntu16.04 and that I'm building ArrayFire from master.

When building my Dockerfile all tests are OK except one for Flashlight:

[ RUN      ] AutogradTest.Variance
/root/flashlight/tests/autograd/AutogradTest.cpp:485: Failure
Value of: allClose(calculated_var.array(), expected_var)
  Actual: false
Expected: true
[  FAILED  ] AutogradTest.Variance (334 ms)

I'm going to build your Dockerfile and see if that will work.

tlikhomanenko commented 4 years ago

@BoneGoat,

If you are building arrayfire from the master be sure that their tests pass, maybe this was an issue in your original bug report.

For the flashligh - there could be some discrepancy in the precision, because kernels are different a bit. Does only AutogradTest.Variance fail for flashlight?

BoneGoat commented 4 years ago

@tlikhomanenko In my original issue I'm running the pre-built docker image. I don't understand why that wouldn't work as Cuda should be backwards compatible. To get things moving I wrote my own Dockerfile but then the AutogradTest.Variance fails and this is the only test that fail for Flashlight. All tests are OK for W2L.

I have now tested your Dockerfile and all tests are OK. So I will move forward with that one.

Thanks for you help!

kriswuollett commented 4 years ago

I wanted to try out w2l with the tutorials / recipes, but I just encountered the same AutogradTest.Variance issue. I don't know enough to judge what variance type is correct, but I think I tracked it down to ArrayFire fixing the specification of their isbiased parameter here: https://github.com/arrayfire/arrayfire/pull/2710.

jacobkahn commented 4 years ago

@kriswuollett — we don't directly use af::var in flashlight, but we're changing the behavior of that test so things work. Thanks for flagging.

flashlight / wav2letter

Tests SegFaults on pre-build Docker image #494