flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.35k stars 1.02k forks source link

Training with continue and fork mode terminated due to unhandled system error #1017

Open drremo1 opened 1 year ago

drremo1 commented 1 year ago

Hello, I have recently installed wav2letter v0.2 on ubuntu 18.04. I am now trying to continue training with the pretrained dev-clean transformer models from sota/2019 recipe for only 1 epoch. However, the training won't start and it immediately gets terminated showing these errors:

I0401 05:59:19.868680 17296 Train.cpp:80] Parsing command line flags
I0401 05:59:19.868815 17296 Train.cpp:81] Overriding flags should be mutable when using `continue`
I0401 05:59:19.868882 17296 Train.cpp:85] Reading flags from file /mnt/d/198/train.cfg
terminate called after throwing an instance of 'std::runtime_error'
  what():  unhandled system error
*** Aborted at 1680299961 (unix time) try "date -d @1680299961" if you are using GNU date ***
PC: @     0x7f5e92f1ce87 gsignal
*** SIGABRT (@0x3e800004390) received by PID 17296 (TID 0x7f5ec06ac380) from PID 17296; stack trace: ***
    @     0x7f5ebf583980 (unknown)
    @     0x7f5e92f1ce87 gsignal
    @     0x7f5e92f1e7f1 abort
    @     0x7f5e93911957 (unknown)
    @     0x7f5e93917ae6 (unknown)
    @     0x7f5e93917b21 std::terminate()
    @     0x7f5e93917d54 __cxa_throw
    @     0x55cf42b5c6f8 fl::detail::ncclCheck()
    @     0x55cf42b5ddd7 fl::distributedInit()
    @     0x55cf42acb387 w2l::initDistributed()
    @     0x55cf4283eab2 main
    @     0x7f5e92effc87 __libc_start_main
    @     0x55cf428a7e4a _start
Aborted

This happens also happens when I try it with fork.

This error was obtained by running this:

wav2letter/build/Train continue /mnt/d/198 --flagsfile /mnt/d/198/train.cfg --logtostderr=1 --minloglevel=0 --rndv_filepath=

At first I thought it was the flagsfile but removing it from the command line gives the same error.