flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.38k stars 1.01k forks source link

wav2letter-cuda:latest test fails in docker #772

Closed baruchih closed 4 years ago

baruchih commented 4 years ago

Bug Description

Hello,

I have downloaded the latest image of wav2letter-cuda (Dockerhub DIGEST:sha256:228eb912d5a61a151de4abf9cd9c25a4aa3a9a912ed2d538c7a3a574d948b11a)

When running the make test within the docker the tests fails. This has been reproduced on 2 separate computers, with cuda 10.1, and cuda 10.2. Same test fails.

Test result:

Running tests...
Test project /root/wav2letter/build
      Start  1: W2lCommonTest
 1/31 Test  #1: W2lCommonTest ....................***Failed    1.00 sec
      Start  2: DictionaryTest
 2/31 Test  #2: DictionaryTest ...................   Passed    0.03 sec
      Start  3: ProducerConsumerQueueTest
 3/31 Test  #3: ProducerConsumerQueueTest ........   Passed    0.03 sec
      Start  4: CriterionTest
 4/31 Test  #4: CriterionTest ....................***Failed    0.07 sec
      Start  5: Seq2SeqTest
 5/31 Test  #5: Seq2SeqTest ......................***Failed    0.04 sec
      Start  6: AttentionTest
 6/31 Test  #6: AttentionTest ....................***Failed    0.07 sec
      Start  7: WindowTest
 7/31 Test  #7: WindowTest .......................***Failed    0.04 sec
      Start  8: DataTest
 8/31 Test  #8: DataTest .........................***Failed    0.11 sec
      Start  9: ListFileDatasetTest
 9/31 Test  #9: ListFileDatasetTest ..............***Failed    0.03 sec
      Start 10: SoundTest
10/31 Test #10: SoundTest ........................   Passed    0.09 sec
      Start 11: DecoderTest
11/31 Test #11: DecoderTest ......................   Passed    1.56 sec
      Start 12: CeplifterTest
12/31 Test #12: CeplifterTest ....................   Passed    0.03 sec
      Start 13: DctTest
13/31 Test #13: DctTest ..........................   Passed    0.08 sec
      Start 14: DerivativesTest
14/31 Test #14: DerivativesTest ..................   Passed    0.03 sec
      Start 15: DitherTest
15/31 Test #15: DitherTest .......................   Passed    8.03 sec
      Start 16: MfccTest
16/31 Test #16: MfccTest .........................   Passed    0.54 sec
      Start 17: PreEmphasisTest
17/31 Test #17: PreEmphasisTest ..................   Passed    0.03 sec
      Start 18: SpeechUtilsTest
18/31 Test #18: SpeechUtilsTest ..................***Failed    0.03 sec
      Start 19: TriFilterbankTest
19/31 Test #19: TriFilterbankTest ................   Passed    0.03 sec
      Start 20: WindowingTest
20/31 Test #20: WindowingTest ....................   Passed    0.03 sec
      Start 21: W2lModuleTest
21/31 Test #21: W2lModuleTest ....................***Failed    0.04 sec
      Start 22: RuntimeTest
22/31 Test #22: RuntimeTest ......................***Failed    0.03 sec
      Start 23: inference_Conv1dTest
23/31 Test #23: inference_Conv1dTest .............   Passed    0.01 sec
      Start 24: inference_IdentityTest
24/31 Test #24: inference_IdentityTest ...........   Passed    0.01 sec
      Start 25: inference_LayerNormTest
25/31 Test #25: inference_LayerNormTest ..........   Passed    0.02 sec
      Start 26: inference_LinearTest
26/31 Test #26: inference_LinearTest .............   Passed    0.01 sec
      Start 27: inference_LogMelFeatureTest
27/31 Test #27: inference_LogMelFeatureTest ......   Passed    0.06 sec
      Start 28: inference_MemoryManagerTest
28/31 Test #28: inference_MemoryManagerTest ......   Passed    0.01 sec
      Start 29: inference_ReluTest
29/31 Test #29: inference_ReluTest ...............   Passed    0.01 sec
      Start 30: inference_ResidualTest
30/31 Test #30: inference_ResidualTest ...........   Passed    0.01 sec
      Start 31: inference_TDSBlockTest
31/31 Test #31: inference_TDSBlockTest ...........   Passed    0.01 sec

68% tests passed, 10 tests failed out of 31

Total Test time (real) =  12.14 sec

The following tests FAILED:
      1 - W2lCommonTest (Failed)
      4 - CriterionTest (Failed)
      5 - Seq2SeqTest (Failed)
      6 - AttentionTest (Failed)
      7 - WindowTest (Failed)
      8 - DataTest (Failed)
      9 - ListFileDatasetTest (Failed)
     18 - SpeechUtilsTest (Failed)
     21 - W2lModuleTest (Failed)
     22 - RuntimeTest (Failed)
Errors while running CTest
Makefile:118: recipe for target 'test' failed
make: *** [test] Error 8

When running the W2lCommonTest for example the output is:

[==========] Running 19 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 19 tests from W2lCommonTest
[ RUN      ] W2lCommonTest.StringTrim
[       OK ] W2lCommonTest.StringTrim (0 ms)
[ RUN      ] W2lCommonTest.ReplaceAll
[       OK ] W2lCommonTest.ReplaceAll (0 ms)
[ RUN      ] W2lCommonTest.StringSplit
[       OK ] W2lCommonTest.StringSplit (0 ms)
[ RUN      ] W2lCommonTest.StringJoin
[       OK ] W2lCommonTest.StringJoin (0 ms)
[ RUN      ] W2lCommonTest.StringFormat
[       OK ] W2lCommonTest.StringFormat (0 ms)
[ RUN      ] W2lCommonTest.PathsConcat
[       OK ] W2lCommonTest.PathsConcat (0 ms)
[ RUN      ] W2lCommonTest.RetryWithBackoff
[       OK ] W2lCommonTest.RetryWithBackoff (753 ms)
[ RUN      ] W2lCommonTest.PackReplabels
[       OK ] W2lCommonTest.PackReplabels (0 ms)
[ RUN      ] W2lCommonTest.Dictionary
[       OK ] W2lCommonTest.Dictionary (0 ms)
[ RUN      ] W2lCommonTest.UnpackReplabels
[       OK ] W2lCommonTest.UnpackReplabels (0 ms)
[ RUN      ] W2lCommonTest.UnpackReplabelsIgnoresInvalid
[       OK ] W2lCommonTest.UnpackReplabelsIgnoresInvalid (0 ms)
[ RUN      ] W2lCommonTest.Uniq
[       OK ] W2lCommonTest.Uniq (0 ms)
[ RUN      ] W2lCommonTest.Normalize
unknown file: Failure
C++ exception with description "ArrayFire Exception (Runtime error :103):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/device_manager.cpp:531
Error initializing CUDA runtime. Check your CUDA device is visible to the OS and you have installed the correct driver. Try running the nvidia-smi utility to debug any driver issues.
 0# 0x00007F04A992D810 in /usr/local/lib/libafcuda.so.3
 1# 0x00007F04A992DE7D in /usr/local/lib/libafcuda.so.3
 2# 0x00007F04A9CC73FA in /usr/local/lib/libafcuda.so.3
 3# 0x00007F04A9CC3CEC in /usr/local/lib/libafcuda.so.3
 4# 0x00007F04A973762D in /usr/local/lib/libafcuda.so.3
 5# 0x00007F04A973771F in /usr/local/lib/libafcuda.so.3
 6# 0x00007F04AB759FAD in /usr/local/lib/libafcuda.so.3
 7# 0x00007F04AA50BB28 in /usr/local/lib/libafcuda.so.3
 8# af_randu in /usr/local/lib/libafcuda.so.3
 9# af::randu(af::dim4 const&, af_dtype) in /usr/local/lib/libafcuda.so.3
10# af::randu(long long, long long, long long, af_dtype) in /usr/local/lib/libafcuda.so.3
11# 0x000055C0B07B0417 in ./src/tests/W2lC" thrown in the test body.
[  FAILED  ] W2lCommonTest.Normalize (7 ms)
[ RUN      ] W2lCommonTest.Transpose
unknown file: Failure
C++ exception with description "ArrayFire Exception (Runtime error :103):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/device_manager.cpp:531
Error initializing CUDA runtime. Check your CUDA device is visible to the OS and you have installed the correct driver. Try running the nvidia-smi utility to debug any driver issues.
 0# 0x00007F04A992D810 in /usr/local/lib/libafcuda.so.3
 1# 0x00007F04A992DE7D in /usr/local/lib/libafcuda.so.3
 2# 0x00007F04A9CC73FA in /usr/local/lib/libafcuda.so.3
 3# 0x00007F04A9CC3CEC in /usr/local/lib/libafcuda.so.3
 4# 0x00007F04A973762D in /usr/local/lib/libafcuda.so.3
 5# 0x00007F04A973771F in /usr/local/lib/libafcuda.so.3
 6# 0x00007F04AB759FAD in /usr/local/lib/libafcuda.so.3
 7# 0x00007F04AA50BB28 in /usr/local/lib/libafcuda.so.3
 8# af_randu in /usr/local/lib/libafcuda.so.3
 9# af::randu(af::dim4 const&, af_dtype) in /usr/local/lib/libafcuda.so.3
10# af::randu(long long, long long, long long, long long, af_dtype) in /usr/local/lib/libafcuda.so.3
11# 0x000055C0B07AFEDA in ./src" thrown in the test body.
[  FAILED  ] W2lCommonTest.Transpose (2 ms)
[ RUN      ] W2lCommonTest.localNormalize
unknown file: Failure
C++ exception with description "ArrayFire Exception (Runtime error :103):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/device_manager.cpp:531
Error initializing CUDA runtime. Check your CUDA device is visible to the OS and you have installed the correct driver. Try running the nvidia-smi utility to debug any driver issues.
 0# 0x00007F04A992D810 in /usr/local/lib/libafcuda.so.3
 1# 0x00007F04A992DE7D in /usr/local/lib/libafcuda.so.3
 2# 0x00007F04A9CC73FA in /usr/local/lib/libafcuda.so.3
 3# 0x00007F04A9CC3CEC in /usr/local/lib/libafcuda.so.3
 4# 0x00007F04A973762D in /usr/local/lib/libafcuda.so.3
 5# 0x00007F04A973771F in /usr/local/lib/libafcuda.so.3
 6# 0x00007F04AB759FAD in /usr/local/lib/libafcuda.so.3
 7# 0x00007F04AA50BB28 in /usr/local/lib/libafcuda.so.3
 8# af_randu in /usr/local/lib/libafcuda.so.3
 9# af::randu(af::dim4 const&, af_dtype) in /usr/local/lib/libafcuda.so.3
10# af::randu(long long, long long, long long, long long, af_dtype) in /usr/local/lib/libafcuda.so.3
11# 0x000055C0B07B2694 in ./src" thrown in the test body.
[  FAILED  ] W2lCommonTest.localNormalize (2 ms)
[ RUN      ] W2lCommonTest.AfMatrixToStrings
unknown file: Failure
C++ exception with description "ArrayFire Exception (Runtime error :103):
In function cuda::DeviceManager::DeviceManager()
In file src/backend/cuda/device_manager.cpp:531
Error initializing CUDA runtime. Check your CUDA device is visible to the OS and you have installed the correct driver. Try running the nvidia-smi utility to debug any driver issues.
 0# 0x00007F04A992D810 in /usr/local/lib/libafcuda.so.3
 1# 0x00007F04A992DE7D in /usr/local/lib/libafcuda.so.3
 2# 0x00007F04A9CC73FA in /usr/local/lib/libafcuda.so.3
 3# 0x00007F04A9CC41EC in /usr/local/lib/libafcuda.so.3
 4# 0x00007F04A976196A in /usr/local/lib/libafcuda.so.3
 5# 0x00007F04A976263C in /usr/local/lib/libafcuda.so.3
 6# af_create_array in /usr/local/lib/libafcuda.so.3
 7# 0x00007F04AA63942A in /usr/local/lib/libafcuda.so.3
 8# af::array::array<int>(long long, long long, int const*, af_source) in /usr/local/lib/libafcuda.so.3
 9# 0x000055C0B07B6052 in ./src/tests/W2lCommonTest
10# 0x000055C0B09B567A in ./src/tests/W2lCommonTest
11# 0x000055C0B09AB023 in ./src/tests/W2lCommon" thrown in the test body.
[  FAILED  ] W2lCommonTest.AfMatrixToStrings (2 ms)
[ RUN      ] W2lCommonTest.WrdToTarget
Falling back to using letters as targets for the unknown word: 111
Falling back to using letters as targets for the unknown word: 199
Skipping unknown token '9' when falling back to letter target for the unknown word: 199
Skipping unknown token '9' when falling back to letter target for the unknown word: 199
Skipping unknown word '111' when generating target
Skipping unknown word '199' when generating target
[       OK ] W2lCommonTest.WrdToTarget (0 ms)
[ RUN      ] W2lCommonTest.TargetToSingleLtr
[       OK ] W2lCommonTest.TargetToSingleLtr (0 ms)
[ RUN      ] W2lCommonTest.UT8Split
[       OK ] W2lCommonTest.UT8Split (0 ms)
[----------] 19 tests from W2lCommonTest (766 ms total)

[----------] Global test environment tear-down
[==========] 19 tests from 1 test suite ran. (767 ms total)
[  PASSED  ] 15 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] W2lCommonTest.Normalize
[  FAILED  ] W2lCommonTest.Transpose
[  FAILED  ] W2lCommonTest.localNormalize
[  FAILED  ] W2lCommonTest.AfMatrixToStrings

 4 FAILED TESTS

Any ideas on what's the issue here?

Reproduction Steps

  1. run docker pull wav2letter/wav2letter:cuda-latest
  2. run docker run --gpus all --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-latest
  3. bash to the container
  4. go to /root/wav2letter/build and run make test

Platform and Hardware

Ubuntu 18.04

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    On   | 00000000:01:00.0  On |                  N/A |
| N/A   58C    P0    27W /  N/A |   1406MiB /  6078MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
tlikhomanenko commented 4 years ago

I did the same command as you pointed, everything works for me.

The problem here could be that driver is not appropriate for the CUDA version we use inside docker. Doesn't matter which CUDA you installed on your machine, only CUDA driver matters. The error you have is pointing exactly to this Error initializing CUDA runtime. Check your CUDA device is visible to the OS and you have installed the correct driver. Try running the nvidia-smi utility to debug any driver issues.

Could you confirm that for other tools/frameworks you don't have problems of CUDA driver with your CUDA 10.1 and 10.2 installed?

Mine machine shows Driver Version: 418.116.00 CUDA Version: 10.1. Probably your driver is too new for the version we build. One possible solution for you (in case of not downgrade your driver) is to rebuild base docker image for the flashlight and then rebuild base docker for the wav2letter.

baruchih commented 4 years ago

Thanks for your reply @tlikhomanenko I tried to rebuild the docker, and the issue still persisted. Than I tried on a different environment. It passed with no issue. I just had to expose the GPU using export CUDA_VISIBLE_DEVICES=1

I will re-install the driver on my first environment, and try again.

Thanks for the help!

iggygeek commented 4 years ago

Hello, I have the same problem with the Docker image, following the wiki : sudo docker run --gpus all --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-latest (by the way runtime=nvidia does not work with the new nvidia docker -> wiki correction needed)

I installed docker, Nvidia docker and driver as required.

Thanks for any help!

tlikhomanenko commented 4 years ago

@iggygeek

what cuda driver version do you have? Also could you try to build image on your machine and test if it works?

iggygeek commented 4 years ago

My apologies, it was just that all GPUs where busy on this machine ...

tlikhomanenko commented 4 years ago

@baruchih feel free to reopen if the problem still persists.