Closed dambaquyen96 closed 5 years ago
@dambaquyen96 — which criterion backend are you using for CTC?
Are you able to start your existing model again with continue
mode/does it fail in the same place?
@jacobkahn I'm training on GPU. This happened on the first epoch, every time I train it (not continue because there is no checkpoint saved yet). I solved this error by adding try catch in the training loop to skip all error batch. I found out that some specific audio files occurred this problem, about 10 files, so maybe there are some problem with my data.
@dambaquyen96 ,
I am trying to train my model. I am getting stuck with the same error. Can u please elaborate how you solved this problem. What did the try and catch block contain? Below is my log on console:
/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/build/Train train --flagsfile /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/train.cfg I0424 16:46:03.113786 14218 Train.cpp:139] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --am=; --arch=network.arch; --archdir=/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/; --attention=content; --attnWindow=no; --batchsize=4; --beamscore=25; --beamsize=2500; --channels=1; --criterion=ctc; --datadir=/media/home/megha/5_wav2letter/WAV_2_LETTER/; --dataorder=input; --devwin=0; --emission_dir=; --enable_distributed=false; --encoderdim=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=40; --flagsfile=/media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/train.cfg; --forceendsil=false; --gamma=0.20000000000000001; --garbage=false; --input=flac; --inputbinsize=100; --inputfeeding=false; --iter=100; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcrit=0.0060000000000000001; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.80000000000000004; --noresample=false; --nthread=4; --nthread_decoder=1; --onorm=target; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --replabel=2; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/media/home/megha/5_wav2letter/WAV_2_LETTER/; --runname=deutsche_Combined_clean_trainlogs; --samplerate=16000; --samplingstrategy=rand; --sclite=; --seed=0; --show=false; --showletters=false; --silweight=0; --skipoov=false; --smearing=none; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1; --surround=|; --tag=; --target=tkn; --targettype=video; --test=; --tokens=wav2letter/tutorials/output/data/tokens.txt; --tokensdir=/media/home/megha/5_wav2letter/WAV_2_LETTER/; --train=wav2letter/tutorials/output/data/train-clean-100; --trainWithWindow=false; --transdiag=0; --unkweight=-inf; --valid=wav2letter/tutorials/output/data/dev-clean; --weightdecay=0; --wordscore=1; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0424 16:46:03.113804 14218 Train.cpp:141] Experiment path: /media/home/megha/5_wav2letter/WAV_2_LETTER/deutsche_Combined_clean_trainlogs I0424 16:46:03.113821 14218 Train.cpp:142] Experiment runidx: 1 I0424 16:46:03.114193 14218 Train.cpp:160] Number of classes (network) = 37 I0424 16:46:03.114209 14218 Train.cpp:171] Loading architecture file from /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/1-librispeech_clean/network.arch I0424 16:46:03.798933 14218 Train.cpp:191] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output] (0): View (-1 1 40 0) (1): Conv2D (40->256, 8x1, 2,1, SAME,SAME) (with bias) (2): ReLU (3): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (4): ReLU (5): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (6): ReLU (7): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (8): ReLU (9): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (10): ReLU (11): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (12): ReLU (13): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (14): ReLU (15): Conv2D (256->256, 8x1, 1,1, SAME,SAME) (with bias) (16): ReLU (17): Reorder (2,0,3,1) (18): Linear (256->512) (with bias) (19): ReLU (20): Linear (512->37) (with bias) I0424 16:46:03.798955 14218 Train.cpp:192] [Network Params: 3904549] I0424 16:46:03.798974 14218 Train.cpp:193] [Criterion] ConnectionistTemporalClassificationCriterion I0424 16:46:03.799340 14218 NumberedFilesLoader.cpp:29] Adding dataset /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/output/data/train-clean-100 ... I0424 16:46:03.799427 14218 NumberedFilesLoader.cpp:68] 2731 files found. I0424 16:46:03.823515 14218 Utils.cpp:102] Filtered 0/2731 samples I0424 16:46:03.823717 14218 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 683 I0424 16:46:03.823861 14218 NumberedFilesLoader.cpp:29] Adding dataset /media/home/megha/5_wav2letter/WAV_2_LETTER/wav2letter/tutorials/output/data/dev-clean ... I0424 16:46:03.823957 14218 NumberedFilesLoader.cpp:68] 960 files found. I0424 16:46:03.831384 14218 Utils.cpp:102] Filtered 0/960 samples I0424 16:46:03.831449 14218 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 240 I0424 16:46:03.928385 14218 Train.cpp:441] Shuffling trainset I0424 16:46:03.928495 14218 Train.cpp:448] Epoch 1 started! terminate called after throwing an instance of 'std::runtime_error' what(): Error: compute_ctc_loss, stat = unknown error Aborted at 1556117170 (unix time) try "date -d @1556117170" if you are using GNU date PC: @ 0x7f3a65af2428 gsignal SIGABRT (@0x3e80000378a) received by PID 14218 (TID 0x7f3ae544c800) from PID 14218; stack trace: @ 0x7f3a65af24b0 (unknown) @ 0x7f3a65af2428 gsignal @ 0x7f3a65af402a abort @ 0x7f3a6665784d __gnu_cxx::verbose_terminate_handler() @ 0x7f3a666556b6 (unknown) @ 0x7f3a66655701 std::terminate() @ 0x7f3a66655919 cxa_throw @ 0x525f5b w2l::(anonymous namespace)::throw_on_error() @ 0x526d01 w2l::ConnectionistTemporalClassificationCriterion::forward() @ 0x45fa3c _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEERNS0_19FirstOrderOptimizerES9_biE4_clES2_S5_S7_S9_S9_bi.constprop.8756 @ 0x418f72 main @ 0x7f3a65add830 __libc_start_main @ 0x45bb19 _start Aborted
My train config file contains settings as below:
# Training config for Mini Librispeech
# Replace `[...]` with appropriate paths
@jacobkahn kindly point out the mistake I have done. My GPU info is as below
:~/Desktop$ glxinfo -B name of display: :0 display: :0 screen: 0 direct rendering: Yes OpenGL vendor string: NVIDIA Corporation OpenGL renderer string: GeForce GTX 1060 6GB/PCIe/SSE2 OpenGL core profile version string: 4.5.0 NVIDIA 410.79 OpenGL core profile shading language version string: 4.50 NVIDIA OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile
OpenGL version string: 4.6.0 NVIDIA 410.79 OpenGL shading language version string: 4.60 NVIDIA OpenGL context flags: (none) OpenGL profile mask: (none)
OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 410.79 OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
I'm training with my own dataset, it has been run perfectly for 400hours (My dataset's size is about 1000hours). Suddenly, it terminated with this error Here is the train.cfg as well: Any one know what kind of this error? I've been debuging & re-running for serveral times, but it still has the same error every time.