flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.38k stars 1.01k forks source link

cannot run Decoding on Tesla T4 #335

Closed nestyme closed 5 years ago

nestyme commented 5 years ago

Hello!

Firstly I want to thank you for great work you done!

I had already successfully trained my model and had no problems to run Decoder (CUDA) in CUDA-docker on Titan GTX. But doing the same on Tesla T4 turns into Error (with CPU-docker it works correctly):

after running ./Decoder --flagsfile decode_flags.cfg

what(): ArrayFire Exception (Internal error:998): In function cuda::Kernel cuda::buildKernel(int, const string&, const string&, const std::vector<std::__cxx11::basic_string >&, bool) In file src/backend/cuda/nvrtc/cache.cpp:160 NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION

In function T* af::array::device() const [with T = void] In file src/api/cpp/array.cpp:941 Aborted at 1560852631 (unix time) try "date -d @1560852631" if you are using GNU date PC: @ 0x7f8d46aa0428 gsignal SIGABRT (@0x6e) received by PID 110 (TID 0x7f8d9093d600) from PID 110; stack trace: @ 0x7f8d4e54f390 (unknown) @ 0x7f8d46aa0428 gsignal @ 0x7f8d46aa202a abort @ 0x7f8d473e384d gnu_cxx::verbose_terminate_handler() @ 0x7f8d473e16b6 (unknown) @ 0x7f8d473e1701 std::terminate() @ 0x7f8d473e1919 cxa_throw @ 0x7f8d69a05588 af::array::device<>() @ 0x6165a5 fl::DevicePtr::DevicePtr() @ 0x64e41e fl::conv2d() @ 0x629116 fl::Conv2D::forward() @ 0x636c7f fl::UnaryModule::forward() @ 0x6280e2 fl::Sequential::forward() @ 0x41b0f4 main @ 0x7f8d46a8b830 libc_start_main @ 0x475d19 _start @ 0x0 (unknown)

NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION

info about my system: NVIDIA-SMI 430.14 Driver Version: 430.14 CUDA Version: 10.2 NVRM version: NVIDIA UNIX x86_64 Kernel Module 430.14 Wed May 8 01:10:53 UTC 2019 GCC version: gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)

During running tests in wav2letter: The following tests FAILED: 1 - W2lCommonTest (SEGFAULT) 2 - CriterionTest (SEGFAULT) 3 - Seq2SeqTest (SEGFAULT) 4 - AttentionTest (SEGFAULT) 5 - WindowTest (SEGFAULT) 6 - DataTest (Failed) 18 - W2lModuleTest (SEGFAULT) 19 - RuntimeTest (Failed) Errors while running CTest Makefile:104: recipe for target 'test' failed make: *** [test] Error 8

During running tests in flashlight: The following tests FAILED: 1 - AutogradTest (SEGFAULT) 2 - OptimTest (SEGFAULT) 3 - ModuleTest (SEGFAULT) 4 - SerializationTest (SEGFAULT) 5 - UtilsTest (Failed) 6 - DatasetTest (SEGFAULT) 7 - MeterTest (Failed) 8 - AllReduceTest (SEGFAULT) 9 - ContribModuleTest (SEGFAULT) 10 - ContribSerializationTest (Failed) Errors while running CTest Makefile:71: recipe for target 'test' failed As it mentioned in #314 -- switching to cuda-0b16293 did not work for me.

I will be grateful for any help. Thank you!

tlikhomanenko commented 5 years ago

Hi @nestyme,

This error could be the issue with GPU driver. Please, check https://github.com/facebookresearch/wav2letter/issues/229 (which looks like the same issue). Could you repeat the steps suggested in https://github.com/facebookresearch/wav2letter/issues/229?

nestyme commented 5 years ago

Hi @tlikhomanenko! Thanks a for your help I solved this problem with building from source ArrayFire 3.6.4 version and another libraries. This problem appeared because Tesla T 4 supports only CUDA 10.0 and 10.1 versions, but ArrayFire in wav2letter++ <3.6.2 version

tlikhomanenko commented 5 years ago

@nestyme

The latest docker images are built with arrayfire 3.6.4 so you can use them now too.

C5YS commented 5 years ago

Hello, I have the same problem. I use docker-nvidia (sudo docker run - runtime = nvidia - rm -itd --ipc = host --name w2l wav2letter / wav2letter: cuda-latest). The specifications of the system is: -Ubuntu 18.04 -Nvidia 2080ti -Driver Version: 418.67 -CUDA 10

When testing ...

Running tests ...

Test project / root / wav2letter / build
      Start 1: W2lCommonTest
 1/19 Test # 1: W2lCommonTest .................... *** Exception: SegFault 2.19 sec
      Start 2: CriterionTest
 2/19 Test # 2: CriterionTest .................... *** Exception: SegFault 1.29 sec
      Start 3: Seq2SeqTest
 3/19 Test # 3: Seq2SeqTest ...................... *** Exception: SegFault 1.25 sec
      Start 4: AttentionTest
 4/19 Test # 4: AttentionTest .................... *** Failed 1.65 sec
      Start 5: WindowTest
 5/19 Test # 5: WindowTest ....................... *** Exception: SegFault 1.19 sec
      Start 6: DataTest
 6/19 Test # 6: DataTest ......................... *** Exception: Other 1.34 sec
      Start 7: DecoderTest
 7/19 Test # 7: DecoderTest ...................... Passed 1.02 sec
      Start 8: CeplifterTest
 8/19 Test # 8: CeplifterTest .................... Passed 0.09 sec
      Start 9: DctTest
 9/19 Test # 9: DctTest .......................... Passed 0.18 sec
      Start 10: DerivativesTest
10/19 Test # 10: DerivativesTest .................. Passed 0.11 sec
      Start 11: DitherTest
11/19 Test # 11: DitherTest ....................... Passed 8.10 sec
      Start 12: MfccTest
12/19 Test # 12: MfccTest ......................... Passed 0.21 sec
      Start 13: PreEmphasisTest
13/19 Test # 13: PreEmphasisTest .................. Passed 0.10 sec
      Start 14: SoundTest
14/19 Test # 14: SoundTest ........................ Passed 0.14 sec
      Start 15: SpeechUtilsTest
15/19 Test # 15: SpeechUtilsTest .................. Passed 2.43 sec
      Start 16: TriFilterbankTest
16/19 Test # 16: TriFilterbankTest ................ Passed 0.13 sec
      Start 17: WindowingTest
17/19 Test # 17: WindowingTest .................... Passed 0.05 sec
      Start 18: W2lModuleTest
18/19 Test # 18: W2lModuleTest .................... *** Exception: SegFault 3.54 sec
      Start 19: RuntimeTest
19/19 Test # 19: RuntimeTest ...................... *** Failed 9.23 sec

58% tests passed, 8 tests failed out of 19

Total Test time (real) = 34.31 sec

The following tests FAILED:
1 - W2lCommonTest (SEGFAULT)
2 - CriterionTest (SEGFAULT)
3 - Seq2SeqTest (SEGFAULT)
4 - AttentionTest (Failed)
5 - WindowTest (SEGFAULT)
6 - DataTest (OTHER_FAULT)
18 - W2lModuleTest (SEGFAULT)
19 - RuntimeTest (Failed)
Errors while running CTest
Makefile: 104: recipe for target 'test' failed
make: *** [test] Error 8

nvcc -V docker: Cuda compilation tools, release 9.2, V9.2.148 nvcc -V without docker: Cuda compilation tools, release 10.0, V10.0.130   I do not know if it affects something ...

Thanks so much for reading.

tlikhomanenko commented 5 years ago

Hi @C5YS,

Could you run each test separately to make sure that the error looks like this NVRTC Error(5): NVRTC_ERROR_INVALID_OPTION?

Here is the compatibility https://github.com/NVIDIA/nvidia-docker/wiki/CUDA, actually you need to have necessary driver version to support cuda 9.2 in docker. But the problem comes from

I solved this problem with building from source ArrayFire 3.6.4 version and another libraries. This problem appeared because Tesla T 4 supports only CUDA 10.0 and 10.1 versions, but ArrayFire in wav2letter++ <3.6.2 version

so your GPU supports only CUDA 10. I think the simplest way is to try to rebuild all images from Dockerfiles (for flashlight, base and gpu, then for wav2letter, base and gpu) with changing the version of nvidia docker image to cuda 10. Could you do this? Do you need more detailed instruction how to do this?

C5YS commented 5 years ago

Thank you very much for answering, @tlikhomanenko. Could you give me more detailed instructions on the reconstruction of all Dockerfiles images for cuda 10, please? I am new to these issues, and I really appreciate your help.

nestyme commented 5 years ago

hi @C5YS Maybe it will be easier to build all from source -- I found detailed tutorial how to do that: https://medium.com/@shaheenkader/how-to-install-wav2letter-dc94c3b74e97

tlikhomanenko commented 5 years ago

hi @C5YS,

I have built docker images with cuda 10.0 for you. Please have a try with them, use

sudo docker run --runtime=nvidia --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-10-latest
C5YS commented 5 years ago

Hello, thank you very much everyone for the help.

Install the docker with the version cuda 10, and at the time of training it generates the following error:

*** Aborted at 1563327542 (unix time) try "date -d @ 1563327542" if you are using GNU date ***
PC: @ 0x7f400e5b5740 GpuCTC <> :: setup_gpu_metadata ()
*** SIGSEGV (@ 0xffffffff5e79461c) received by PID 1238 (TID 0x7f400ee72600) from PID 1585006108; stack trace: ***
    @ 0x7f3fc953c390 (unknown)
    @ 0x7f400e5b5740 GpuCTC <> :: setup_gpu_metadata ()
    @ 0x7f400e5b59f2 GpuCTC <> :: compute_cost_and_score ()
    @ 0x7f400e5b2d5d compute_ctc_loss
    @ 0x56bd8a w2l :: ConnectionistTemporalClassificationCriterion :: forward ()
    @ 0x47df7c _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_S9_ddbi.constprop.11262
    @ 0x41b752 main
    @ 0x7f3fc167a830 __libc_start_main
    @ 0x479279 _start
    @ 0x0 (unknown)
Segmentation fault (core dumped)

the same error as: #223

and, when changing "--criterion" from ctc to asg, I have the following error:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  Unknown index in dictionary '1024674700'
*** Aborted at 1563327648 (unix time) try "date -d @1563327648" if you are using GNU date ***
PC: @     0x7fd0db681428 gsignal
*** SIGABRT (@0x5c2) received by PID 1474 (TID 0x7fd128e64600) from PID 1474; stack trace: ***
    @     0x7fd0e352e390 (unknown)
    @     0x7fd0db681428 gsignal
    @     0x7fd0db68302a abort
    @     0x7fd0dbfc484d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fd0dbfc26b6 (unknown)
    @     0x7fd0dbfc2701 std::terminate()
    @     0x7fd0dbfc2919 __cxa_throw
    @           0x5580db _ZNK3w2l10Dictionary8getEntryB5cxx11Ei
    @           0x564171 _ZN3w2l10tknIdx2LtrB5cxx11ERKSt6vectorIiSaIiEERKNS_10DictionaryE
    @           0x56619d _ZN3w2l17tknPrediction2LtrB5cxx11ESt6vectorIiSaIiEERKNS_10DictionaryE
    @           0x47b087 _ZZ4mainENKUlRKN2af5arrayES2_RN3w2l13DatasetMetersEE1_clES2_S2_S5_
    @           0x47e188 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.11262
    @           0x41b752 main
    @     0x7fd0db66c830 __libc_start_main
    @           0x479279 _start
    @                0x0 (unknown)
Aborted (core dumped)

The same as: #349

I do not think the problem is the data set with which I train (it has worked on other occasions with cpu and gpu-cuda 9.2).

tlikhomanenko commented 5 years ago

Hi @C5YS,

Could you at first run all tests from flashlight and wav2letter to be sure that the previous problem with CUDA version is resolved?

If all tests now pass, please, open a new issue with the above comments on errors (and specify what docker image and what GPU type you are using).

C5YS commented 5 years ago

Hi @tlikhomanenko. When I run: "cd / root / wav2letter / build && make test" it shows me the following:

Test project /root/wav2letter/build
      Start  1: W2lCommonTest
 1/20 Test  #1: W2lCommonTest ....................   Passed   10.60 sec
      Start  2: DictionaryTest
 2/20 Test  #2: DictionaryTest ...................   Passed    0.14 sec
      Start  3: CriterionTest
 3/20 Test  #3: CriterionTest ....................***Exception: SegFault  1.69 sec
      Start  4: Seq2SeqTest
 4/20 Test  #4: Seq2SeqTest ......................   Passed   13.38 sec
      Start  5: AttentionTest
 5/20 Test  #5: AttentionTest ....................   Passed    3.73 sec
      Start  6: WindowTest
 6/20 Test  #6: WindowTest .......................   Passed    2.78 sec
      Start  7: DataTest
 7/20 Test  #7: DataTest .........................   Passed    1.38 sec
      Start  8: SoundTest
 8/20 Test  #8: SoundTest ........................   Passed    0.30 sec
      Start  9: DecoderTest
 9/20 Test  #9: DecoderTest ......................   Passed    1.08 sec
      Start 10: CeplifterTest
10/20 Test #10: CeplifterTest ....................   Passed    0.11 sec
      Start 11: DctTest
11/20 Test #11: DctTest ..........................   Passed    0.26 sec
      Start 12: DerivativesTest
12/20 Test #12: DerivativesTest ..................   Passed    0.11 sec
      Start 13: DitherTest
13/20 Test #13: DitherTest .......................   Passed    8.11 sec
      Start 14: MfccTest
14/20 Test #14: MfccTest .........................   Passed    0.38 sec
      Start 15: PreEmphasisTest
15/20 Test #15: PreEmphasisTest ..................   Passed    0.22 sec
      Start 16: SpeechUtilsTest
16/20 Test #16: SpeechUtilsTest ..................   Passed    1.34 sec
      Start 17: TriFilterbankTest
17/20 Test #17: TriFilterbankTest ................   Passed    0.14 sec
      Start 18: WindowingTest
18/20 Test #18: WindowingTest ....................   Passed    0.10 sec
      Start 19: W2lModuleTest
19/20 Test #19: W2lModuleTest ....................   Passed    3.62 sec
      Start 20: RuntimeTest
20/20 Test #20: RuntimeTest ......................   Passed    2.06 sec

95% tests passed, 1 tests failed out of 20

Total Test time (real) =  51.63 sec

The following tests FAILED:
      3 - CriterionTest (SEGFAULT)
Errors while running CTest
Makefile:104: recipe for target 'test' failed
make: *** [test] Error 8

Test from flashlight:

~/flashlight/build# make test    
Running tests...
Test project /root/flashlight/build
      Start  1: AutogradTest
 1/10 Test  #1: AutogradTest .....................   Passed   40.54 sec
      Start  2: OptimTest
 2/10 Test  #2: OptimTest ........................   Passed    1.53 sec
      Start  3: ModuleTest
 3/10 Test  #3: ModuleTest .......................   Passed    4.35 sec
      Start  4: SerializationTest
 4/10 Test  #4: SerializationTest ................   Passed    8.11 sec
      Start  5: UtilsTest
 5/10 Test  #5: UtilsTest ........................   Passed    0.94 sec
      Start  6: DatasetTest
 6/10 Test  #6: DatasetTest ......................   Passed    2.88 sec
      Start  7: MeterTest
 7/10 Test  #7: MeterTest ........................   Passed    0.99 sec
      Start  8: AllReduceTest
 8/10 Test  #8: AllReduceTest ....................   Passed    1.98 sec
      Start  9: ContribModuleTest
 9/10 Test  #9: ContribModuleTest ................   Passed    3.58 sec
      Start 10: ContribSerializationTest
10/10 Test #10: ContribSerializationTest .........   Passed    3.05 sec

100% tests passed, 0 tests failed out of 10

Total Test time (real) =  67.96 sec

Thanks for the help, I really appreciate it.

tlikhomanenko commented 5 years ago

@C5YS, the issue with what(): Unknown index in dictionary '1024674700' is resolved https://github.com/facebookresearch/wav2letter/issues/349.

Didn't update docker images yet, but you can go into container and update the wav2letter folder, rerun cmake and make inside it.

tlikhomanenko commented 5 years ago

for issue with CTC, please look at https://github.com/facebookresearch/wav2letter/issues/370 (still in progress)

khu834 commented 4 years ago

@C5YS, the issue with what(): Unknown index in dictionary '1024674700' is resolved #349.

Didn't update docker images yet, but you can go into container and update the wav2letter folder, rerun cmake and make inside it.

Anyone rebuilding from inside the CUDA 10 docker, apart from pulling the latest wav2letter, you'll also need to pull, build and install the latest flashlight

When building flashlight base using "nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04", you may need to remove the very last line "ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1"

At least when I pulled the cuda:10.0 image, the file already exists, so you'll get a File Exists error when linking.