flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.38k stars 1.01k forks source link

Train error :(CriterionTest....................***Exception: SegFault 1.10 sec) #370

Closed OUC-lan closed 3 years ago

OUC-lan commented 5 years ago

I build wav2letter by docker.

sudo docker run --runtime=nvidia --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-10-latest 
sudo docker exec -it w2l bash

When trying to run Train, I get this error:

root@0c7246f8bb8b:~/wav2letter/build# ./Train train --flagsfile ../tutorials/1-librispeech_clean/train.cfg 
*** Aborted at 1564390732 (unix time) try "date -d @1564390732" if you are using GNU date ***
PC: @     0x7fbb9390b740 GpuCTC<>::setup_gpu_metadata()
*** SIGSEGV (@0xffffffff5c7e0eb0) received by PID 243 (TID 0x7fbb941c8600) from PID 1551765168; stack trace: ***
    @     0x7fbb4e892390 (unknown)
    @     0x7fbb9390b740 GpuCTC<>::setup_gpu_metadata()
    @     0x7fbb9390b9f2 GpuCTC<>::compute_cost_and_score()
    @     0x7fbb93908d5d compute_ctc_loss
    @           0x56bd8a w2l::ConnectionistTemporalClassificationCriterion::forward()
    @           0x47df7c _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.11262
    @           0x41b752 main
    @     0x7fbb469d0830 __libc_start_main
    @           0x479279 _start
    @                0x0 (unknown)
Segmentation fault (core dumped)

Then I make test for wav2letter++ and flashlight,

the results of wav2letter++:

root@0c7246f8bb8b:~/wav2letter/build# make test
Running tests...
Test project /root/wav2letter/build
      Start  1: W2lCommonTest
 1/20 Test  #1: W2lCommonTest ....................   Passed    2.84 sec
      Start  2: DictionaryTest
 2/20 Test  #2: DictionaryTest ...................   Passed    0.02 sec
      Start  3: CriterionTest
 3/20 Test  #3: CriterionTest ....................***Exception: SegFault  1.10 sec
      Start  4: Seq2SeqTest
 4/20 Test  #4: Seq2SeqTest ......................   Passed    6.94 sec
      Start  5: AttentionTest
 5/20 Test  #5: AttentionTest ....................   Passed    3.30 sec
      Start  6: WindowTest
 6/20 Test  #6: WindowTest .......................   Passed    2.33 sec
      Start  7: DataTest
 7/20 Test  #7: DataTest .........................   Passed    0.89 sec
      Start  8: SoundTest
 8/20 Test  #8: SoundTest ........................   Passed    0.04 sec
      Start  9: DecoderTest
 9/20 Test  #9: DecoderTest ......................   Passed    0.82 sec
      Start 10: CeplifterTest
10/20 Test #10: CeplifterTest ....................   Passed    0.02 sec
      Start 11: DctTest
11/20 Test #11: DctTest ..........................   Passed    0.05 sec
      Start 12: DerivativesTest
12/20 Test #12: DerivativesTest ..................   Passed    0.02 sec
      Start 13: DitherTest
13/20 Test #13: DitherTest .......................   Passed    8.03 sec
      Start 14: MfccTest
14/20 Test #14: MfccTest .........................   Passed    0.19 sec
      Start 15: PreEmphasisTest
15/20 Test #15: PreEmphasisTest ..................   Passed    0.02 sec
      Start 16: SpeechUtilsTest
16/20 Test #16: SpeechUtilsTest ..................   Passed    0.96 sec
      Start 17: TriFilterbankTest
17/20 Test #17: TriFilterbankTest ................   Passed    0.02 sec
      Start 18: WindowingTest
18/20 Test #18: WindowingTest ....................   Passed    0.02 sec
      Start 19: W2lModuleTest
19/20 Test #19: W2lModuleTest ....................   Passed    3.19 sec
      Start 20: RuntimeTest
20/20 Test #20: RuntimeTest ......................   Passed    1.92 sec

95% tests passed, 1 tests failed out of 20

Total Test time (real) =  32.72 sec

The following tests FAILED:
      3 - CriterionTest (SEGFAULT)
Errors while running CTest
Makefile:104: recipe for target 'test' failed
make: *** [test] Error 8

root@0c7246f8bb8b:~/wav2letter/build# src/tests/CriterionTest 
[==========] Running 17 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 17 tests from CriterionTest
[ RUN      ] CriterionTest.CTCEmptyTarget
Segmentation fault (core dumped)

the results of flashlight:

root@0c7246f8bb8b:~/flashlight/build# make test
Running tests...
Test project /root/flashlight/build
      Start  1: AutogradTest
 1/10 Test  #1: AutogradTest .....................   Passed   32.47 sec
      Start  2: OptimTest
 2/10 Test  #2: OptimTest ........................   Passed    1.39 sec
      Start  3: ModuleTest
 3/10 Test  #3: ModuleTest .......................   Passed    3.99 sec
      Start  4: SerializationTest
 4/10 Test  #4: SerializationTest ................   Passed    6.51 sec
      Start  5: UtilsTest
 5/10 Test  #5: UtilsTest ........................   Passed    0.88 sec
      Start  6: DatasetTest
 6/10 Test  #6: DatasetTest ......................   Passed    2.46 sec
      Start  7: MeterTest
 7/10 Test  #7: MeterTest ........................   Passed    0.84 sec
      Start  8: AllReduceTest
 8/10 Test  #8: AllReduceTest ....................   Passed    1.47 sec
      Start  9: ContribModuleTest
 9/10 Test  #9: ContribModuleTest ................   Passed    3.00 sec
      Start 10: ContribSerializationTest
10/10 Test #10: ContribSerializationTest .........   Passed    2.55 sec

100% tests passed, 0 tests failed out of 10

Total Test time (real) =  55.57 sec

Any help please?

jacobkahn commented 5 years ago

@OUC-lan — can you run CriterionTest without that CTCEmptyTarget test to see if everything else passes? That's given us problems in the past.

OUC-lan commented 5 years ago

I rebuild wav2letter,and run CriterionTest without that CTCEmptyTarget.the results of wav2letter++:

root@0c7246f8bb8b:~/wav2letter/build# make test
Running tests...
Test project /root/wav2letter/build
      Start  1: W2lCommonTest
 1/20 Test  #1: W2lCommonTest ....................   Passed    3.37 sec
      Start  2: DictionaryTest
 2/20 Test  #2: DictionaryTest ...................   Passed    0.03 sec
      Start  3: CriterionTest
 3/20 Test  #3: CriterionTest ....................***Failed    0.98 sec
      Start  4: Seq2SeqTest
 4/20 Test  #4: Seq2SeqTest ......................   Passed    6.66 sec
      Start  5: AttentionTest
 5/20 Test  #5: AttentionTest ....................   Passed    3.05 sec
      Start  6: WindowTest
 6/20 Test  #6: WindowTest .......................   Passed    2.10 sec
      Start  7: DataTest
 7/20 Test  #7: DataTest .........................   Passed    0.93 sec
      Start  8: SoundTest
 8/20 Test  #8: SoundTest ........................   Passed    0.05 sec
      Start  9: DecoderTest
 9/20 Test  #9: DecoderTest ......................   Passed    0.81 sec
      Start 10: CeplifterTest
10/20 Test #10: CeplifterTest ....................   Passed    0.03 sec
      Start 11: DctTest
11/20 Test #11: DctTest ..........................   Passed    0.05 sec
      Start 12: DerivativesTest
12/20 Test #12: DerivativesTest ..................   Passed    0.02 sec
      Start 13: DitherTest
13/20 Test #13: DitherTest .......................   Passed    8.04 sec
      Start 14: MfccTest
14/20 Test #14: MfccTest .........................   Passed    0.20 sec
      Start 15: PreEmphasisTest
15/20 Test #15: PreEmphasisTest ..................   Passed    0.03 sec
      Start 16: SpeechUtilsTest
16/20 Test #16: SpeechUtilsTest ..................   Passed    0.94 sec
      Start 17: TriFilterbankTest
17/20 Test #17: TriFilterbankTest ................   Passed    0.02 sec
      Start 18: WindowingTest
18/20 Test #18: WindowingTest ....................   Passed    0.02 sec
      Start 19: W2lModuleTest
19/20 Test #19: W2lModuleTest ....................   Passed    2.89 sec
      Start 20: RuntimeTest
20/20 Test #20: RuntimeTest ......................   Passed    1.90 sec

95% tests passed, 1 tests failed out of 20

Total Test time (real) =  32.12 sec

The following tests FAILED:
      3 - CriterionTest (Failed)
Errors while running CTest
Makefile:104: recipe for target 'test' failed
make: *** [test] Error 8
root@0c7246f8bb8b:~/wav2letter/build# src/tests/CriterionTest 
[==========] Running 16 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 16 tests from CriterionTest
[ RUN      ] CriterionTest.CTCCost
/root/wav2letter/src/criterion/test/CriterionTest.cpp:93: Failure
The difference between loss1.scalar<float>() and 0.0 is nan, which exceeds kEpsilon, where
loss1.scalar<float>() evaluates to nan,
0.0 evaluates to 0, and
kEpsilon evaluates to 9.9999997473787516e-06.
[  FAILED  ] CriterionTest.CTCCost (619 ms)
[ RUN      ] CriterionTest.CTCJacobian
unknown file: Failure
C++ exception with description "Error: compute_ctc_loss, stat = execution failed" thrown in the test body.
[  FAILED  ] CriterionTest.CTCJacobian (194 ms)
[ RUN      ] CriterionTest.Batching
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void* cuda::MemoryManager::nativeAlloc(size_t)
In file src/backend/cuda/memory.cpp:149
CUDA Error (77): an illegal memory access was encountered

In function af::array af::randu(const af::dim4&, af::dtype)
In file src/api/cpp/random.cpp:78" thrown in the test body.
[  FAILED  ] CriterionTest.Batching (0 ms)
[ RUN      ] CriterionTest.CTCCompareTensorflow
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Array<T>::Array(af::dim4, const T*, bool, bool) [with T = float]
In file src/backend/cuda/Array.cpp:74
CUDA Error (77): an illegal memory access was encountered

In function void {anonymous}::initDataArray(void**, const void*, af::dtype, af::source, dim_t, dim_t, dim_t, dim_t)
In file src/api/cpp/array.cpp:103" thrown in the test body.
[  FAILED  ] CriterionTest.CTCCompareTensorflow (0 ms)
[ RUN      ] CriterionTest.ViterbiPath
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<common::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:329
CU Error CUDA_ERROR_ILLEGAL_ADDRESS(700): an illegal memory access was encountered

In function af::array::array_proxy& af::array::array_proxy::operator=(const af::array&)
In file src/api/cpp/array.cpp:470" thrown in the test body.
[  FAILED  ] CriterionTest.ViterbiPath (0 ms)
[ RUN      ] CriterionTest.FCCCost
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Array<T>::Array(af::dim4, const T*, bool, bool) [with T = float]
In file src/backend/cuda/Array.cpp:74
CUDA Error (77): an illegal memory access was encountered

In function void {anonymous}::initDataArray(void**, const void*, af::dtype, af::source, dim_t, dim_t, dim_t, dim_t)
In file src/api/cpp/array.cpp:103" thrown in the test body.
[  FAILED  ] CriterionTest.FCCCost (0 ms)
[ RUN      ] CriterionTest.FCCJacobian
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<common::Node*>) [with T = int]
In file src/backend/cuda/jit.cpp:329
CU Error CUDA_ERROR_ILLEGAL_ADDRESS(700): an illegal memory access was encountered

In function T* af::array::device() const [with T = void]
In file src/api/cpp/array.cpp:941" thrown in the test body.
[  FAILED  ] CriterionTest.FCCJacobian (0 ms)
[ RUN      ] CriterionTest.FACCost
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Array<T>::Array(af::dim4, const T*, bool, bool) [with T = float]
In file src/backend/cuda/Array.cpp:74
CUDA Error (77): an illegal memory access was encountered

In function void {anonymous}::initDataArray(void**, const void*, af::dtype, af::source, dim_t, dim_t, dim_t, dim_t)
In file src/api/cpp/array.cpp:103" thrown in the test body.
[  FAILED  ] CriterionTest.FACCost (0 ms)
[ RUN      ] CriterionTest.FACJacobian
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Array<T>::Array(af::dim4, const T*, bool, bool) [with T = int]
In file src/backend/cuda/Array.cpp:74
CUDA Error (77): an illegal memory access was encountered

In function void {anonymous}::initDataArray(void**, const void*, af::dtype, af::source, dim_t, dim_t, dim_t, dim_t)
In file src/api/cpp/array.cpp:103" thrown in the test body.
[  FAILED  ] CriterionTest.FACJacobian (1 ms)
[ RUN      ] CriterionTest.ASGCost
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::kernel::identity(cuda::Param<T>) [with T = float]
In file src/backend/cuda/kernel/identity.hpp:58
CUDA Error (77): an illegal memory access was encountered

In function af::array af::identity(const af::dim4&, af::dtype)
In file src/api/cpp/data.cpp:152" thrown in the test body.
[  FAILED  ] CriterionTest.ASGCost (0 ms)
[ RUN      ] CriterionTest.ASGJacobian
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Array<T>::Array(af::dim4, const T*, bool, bool) [with T = int]
In file src/backend/cuda/Array.cpp:74
CUDA Error (77): an illegal memory access was encountered

In function void {anonymous}::initDataArray(void**, const void*, af::dtype, af::source, dim_t, dim_t, dim_t, dim_t)
In file src/api/cpp/array.cpp:103" thrown in the test body.
[  FAILED  ] CriterionTest.ASGJacobian (0 ms)
[ RUN      ] CriterionTest.LinSegJacobian
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function cuda::Array<T>::Array(af::dim4, const T*, bool, bool) [with T = int]
In file src/backend/cuda/Array.cpp:74
CUDA Error (77): an illegal memory access was encountered

In function void {anonymous}::initDataArray(void**, const void*, af::dtype, af::source, dim_t, dim_t, dim_t, dim_t)
In file src/api/cpp/array.cpp:103" thrown in the test body.
[  FAILED  ] CriterionTest.LinSegJacobian (0 ms)
[ RUN      ] CriterionTest.ASGBatching
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void* cuda::MemoryManager::nativeAlloc(size_t)
In file src/backend/cuda/memory.cpp:149
CUDA Error (77): an illegal memory access was encountered

In function af::array af::randu(const af::dim4&, af::dtype)
In file src/api/cpp/random.cpp:78" thrown in the test body.
[  FAILED  ] CriterionTest.ASGBatching (0 ms)
[ RUN      ] CriterionTest.ASGCompareLua
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::kernel::identity(cuda::Param<T>) [with T = float]
In file src/backend/cuda/kernel/identity.hpp:58
CUDA Error (77): an illegal memory access was encountered

In function af::array af::identity(const af::dim4&, af::dtype)
In file src/api/cpp/data.cpp:152" thrown in the test body.
[  FAILED  ] CriterionTest.ASGCompareLua (0 ms)
[ RUN      ] CriterionTest.LinSegCompareLua
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::kernel::identity(cuda::Param<T>) [with T = float]
In file src/backend/cuda/kernel/identity.hpp:58
CUDA Error (77): an illegal memory access was encountered

In function af::array af::identity(const af::dim4&, af::dtype)
In file src/api/cpp/data.cpp:152" thrown in the test body.
[  FAILED  ] CriterionTest.LinSegCompareLua (0 ms)
[ RUN      ] CriterionTest.AsgSerialization
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void* cuda::MemoryManager::nativeAlloc(size_t)
In file src/backend/cuda/memory.cpp:149
CUDA Error (77): an illegal memory access was encountered

In function af::array af::identity(const af::dim4&, af::dtype)
In file src/api/cpp/data.cpp:152" thrown in the test body.
[  FAILED  ] CriterionTest.AsgSerialization (0 ms)
[----------] 16 tests from CriterionTest (814 ms total)

[----------] Global test environment tear-down
[==========] 16 tests from 1 test case ran. (814 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 16 tests, listed below:
[  FAILED  ] CriterionTest.CTCCost
[  FAILED  ] CriterionTest.CTCJacobian
[  FAILED  ] CriterionTest.Batching
[  FAILED  ] CriterionTest.CTCCompareTensorflow
[  FAILED  ] CriterionTest.ViterbiPath
[  FAILED  ] CriterionTest.FCCCost
[  FAILED  ] CriterionTest.FCCJacobian
[  FAILED  ] CriterionTest.FACCost
[  FAILED  ] CriterionTest.FACJacobian
[  FAILED  ] CriterionTest.ASGCost
[  FAILED  ] CriterionTest.ASGJacobian
[  FAILED  ] CriterionTest.LinSegJacobian
[  FAILED  ] CriterionTest.ASGBatching
[  FAILED  ] CriterionTest.ASGCompareLua
[  FAILED  ] CriterionTest.LinSegCompareLua
[  FAILED  ] CriterionTest.AsgSerialization

16 FAILED TESTS
jacobkahn commented 5 years ago

@OUC-lan just to eliminate some things, can you try building and running the warpctc tests independently of wav2letter and check that everything works?

Can you also confirm your CUDA version, CUDA driver version, and GPU model/type?

cc @jcai1

tlikhomanenko commented 3 years ago

close due to inactivity + too old issue.