Training Error: "Error: compute_ctc_loss, stat = unknown error"

dambaquyen96 commented 5 years ago

I'm training with my own dataset, it has been run perfectly for 400hours (My dataset's size is about 1000hours). Suddenly, it terminated with this error Screenshot from 2019-04-02 14-39-12 Here is the train.cfg as well: Any one know what kind of this error? I've been debuging & re-running for serveral times, but it still has the same error every time.

jacobkahn commented 5 years ago

@dambaquyen96 — which criterion backend are you using for CTC?

Are you able to start your existing model again with continue mode/does it fail in the same place?

dambaquyen96 commented 5 years ago

@jacobkahn I'm training on GPU. This happened on the first epoch, every time I train it (not continue because there is no checkpoint saved yet). I solved this error by adding try catch in the training loop to skip all error batch. I found out that some specific audio files occurred this problem, about 10 files, so maybe there are some problem with my data.

megharangaswamy commented 5 years ago

@dambaquyen96 ,

I am trying to train my model. I am getting stuck with the same error. Can u please elaborate how you solved this problem. What did the try and catch block contain? Below is my log on console:

/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/build/Train train --flagsfile /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/train.cfg I0424 16:46:03.113786 14218 Train.cpp:139] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --am=; --arch=network.arch; --archdir=/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/; --attention=content; --attnWindow=no; --batchsize=4; --beamscore=25; --beamsize=2500; --channels=1; --criterion=ctc; --datadir=/media/home/megha/5_wav2letter/WAV_2_LETTER/; --dataorder=input; --devwin=0; --emission_dir=; --enable_distributed=false; --encoderdim=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=40; --flagsfile=/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/train.cfg; --forceendsil=false; --gamma=0.20000000000000001; --garbage=false; --input=flac; --inputbinsize=100; --inputfeeding=false; --iter=100; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcrit=0.0060000000000000001; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.80000000000000004; --noresample=false; --nthread=4; --nthread_decoder=1; --onorm=target; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --replabel=2; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/media/home/megha/5_wav2letter/WAV_2_LETTER/; --runname=deutsche_Combined_clean_trainlogs; --samplerate=16000; --samplingstrategy=rand; --sclite=; --seed=0; --show=false; --showletters=false; --silweight=0; --skipoov=false; --smearing=none; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1; --surround=|; --tag=; --target=tkn; --targettype=video; --test=; --tokens=wav2letter/tutorials/output/data/tokens.txt; --tokensdir=/media/home/megha/5_wav2letter/WAV_2_LETTER/; --train=wav2letter/tutorials/output/data/train-clean-100; --trainWithWindow=false; --transdiag=0; --unkweight=-inf; --valid=wav2letter/tutorials/output/data/dev-clean; --weightdecay=0; --wordscore=1; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0424 16:46:03.113804 14218 Train.cpp:141] Experiment path: /media/home/megha/5_wav2letter/WAV_2_LETTER/deutsche_Combined_clean_trainlogs I0424 16:46:03.113821 14218 Train.cpp:142] Experiment runidx: 1 I0424 16:46:03.114193 14218 Train.cpp:160] Number of classes (network) = 37 I0424 16:46:03.114209 14218 Train.cpp:171] Loading architecture file from /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/network.arch I0424 16:46:03.798933 14218 Train.cpp:191] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output] (0): View (-1 1 40 0) (1): Conv2D (40->256, 8x1, 2,1, SAME,SAME) (with bias) (2): ReLU (3): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (4): ReLU (5): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (6): ReLU (7): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (8): ReLU (9): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (10): ReLU (11): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (12): ReLU (13): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (14): ReLU (15): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (16): ReLU (17): Reorder (2,0,3,1) (18): Linear (256->512) (with bias) (19): ReLU (20): Linear (512->37) (with bias) I0424 16:46:03.798955 14218 Train.cpp:192] [Network Params: 3904549] I0424 16:46:03.798974 14218 Train.cpp:193] [Criterion] ConnectionistTemporalClassificationCriterion I0424 16:46:03.799340 14218 NumberedFilesLoader.cpp:29] Adding dataset /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/output/data/train-clean-100 ... I0424 16:46:03.799427 14218 NumberedFilesLoader.cpp:68] 2731 files found. I0424 16:46:03.823515 14218 Utils.cpp:102] Filtered 0/2731 samples I0424 16:46:03.823717 14218 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 683 I0424 16:46:03.823861 14218 NumberedFilesLoader.cpp:29] Adding dataset /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/output/data/dev-clean ... I0424 16:46:03.823957 14218 NumberedFilesLoader.cpp:68] 960 files found. I0424 16:46:03.831384 14218 Utils.cpp:102] Filtered 0/960 samples I0424 16:46:03.831449 14218 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 240 I0424 16:46:03.928385 14218 Train.cpp:441] Shuffling trainset I0424 16:46:03.928495 14218 Train.cpp:448] Epoch 1 started! terminate called after throwing an instance of 'std::runtime_error' what(): Error: compute_ctc_loss, stat = unknown error Aborted at 1556117170 (unix time) try "date -d @1556117170" if you are using GNU date PC: @ 0x7f3a65af2428 gsignal SIGABRT (@0x3e80000378a) received by PID 14218 (TID 0x7f3ae544c800) from PID 14218; stack trace: @ 0x7f3a65af24b0 (unknown) @ 0x7f3a65af2428 gsignal @ 0x7f3a65af402a abort @ 0x7f3a6665784d __gnu_cxx::verbose_terminate_handler() @ 0x7f3a666556b6 (unknown) @ 0x7f3a66655701 std::terminate() @ 0x7f3a66655919 cxa_throw @ 0x525f5b w2l::(anonymous namespace)::throw_on_error() @ 0x526d01 w2l::ConnectionistTemporalClassificationCriterion::forward() @ 0x45fa3c _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEERNS0_19FirstOrderOptimizerES9_biE4_clES2_S5_S7_S9_S9_bi.constprop.8756 @ 0x418f72 main @ 0x7f3a65add830 __libc_start_main @ 0x45bb19 _start Aborted

My train config file contains settings as below:

# Training config for Mini Librispeech
# Replace `[...]` with appropriate paths

--datadir=/media/home/megha/5_wav2letter/WAV_2_LETTER/
--tokensdir=/media/home/megha/5_wav2letter/WAV_2_LETTER/
--rundir=/media/home/megha/5_wav2letter/WAV_2_LETTER/
--archdir=/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/
--train=wav2letter/tutorials/output/data/train-clean-100
--valid=wav2letter/tutorials/output/data/dev-clean
--input=flac
--arch=network.arch
--tokens=wav2letter/tutorials/output/data/tokens.txt
--criterion=ctc
--lr=0.05
--lrcrit=0.006
--gamma=0.2
--momentum=0.8
--stepsize=1
--maxgradnorm=1.0
--replabel=2
--surround=|
--onorm=target
--sqnorm=true
--mfsc=true
--filterbanks=40
--nthread=4
--batchsize=4
--runname=deutsche_Combined_clean_trainlogs
--iter=100
--logtostderr=1

@jacobkahn kindly point out the mistake I have done. My GPU info is as below

:~/Desktop$ glxinfo -B name of display: :0 display: :0 screen: 0 direct rendering: Yes OpenGL vendor string: NVIDIA Corporation OpenGL renderer string: GeForce GTX 1060 6GB/PCIe/SSE2 OpenGL core profile version string: 4.5.0 NVIDIA 410.79 OpenGL core profile shading language version string: 4.50 NVIDIA OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile

OpenGL version string: 4.6.0 NVIDIA 410.79 OpenGL shading language version string: 4.60 NVIDIA OpenGL context flags: (none) OpenGL profile mask: (none)

OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 410.79 OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

flashlight / wav2letter

Training Error: "Error: compute_ctc_loss, stat = unknown error" #255