Closed misbullah closed 5 years ago
I have the same error message!
I was able to train a model when following the tutorial. However, now I am trying the recipe and that one fails. Training on a different dataset though.
From the logs, I can see the loss explodes so I assume the gradients explode. Is this a software issue in this case or are the parameters not appropriate? The dataset I use contains 1000 hours of speech too. What could have caused this?
epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 03:02:55 | bch(ms): 878.05 | smp(ms): 0.33 | fwd(ms): 348.53 | crit-fwd(ms): 9.69 | bwd(ms): 479.12 | optim(ms): 48.19 | loss: 13.72964 | train-TER: 98.41 | data/dev-TER: 97.69 | avg-isz: 206 | avg-tsz: 049 | max-tsz: 103 | hrs: 28.70 | thrpt(sec/sec): 9.41
epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 03:02:49 | bch(ms): 877.52 | smp(ms): 0.32 | fwd(ms): 348.12 | crit-fwd(ms): 9.38 | bwd(ms): 479.28 | optim(ms): 48.04 | loss: 15.68292 | train-TER: 98.87 | data/dev-TER: 97.71 | avg-isz: 206 | avg-tsz: 049 | max-tsz: 103 | hrs: 28.70 | thrpt(sec/sec): 9.42
epoch: 2 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 03:02:48 | bch(ms): 877.44 | smp(ms): 0.31 | fwd(ms): 348.05 | crit-fwd(ms): 9.36 | bwd(ms): 479.31 | optim(ms): 48.04 | loss: 15.56029 | train-TER: 99.22 | data/dev-TER: 100.00 | avg-isz: 206 | avg-tsz: 049 | max-tsz: 103 | hrs: 28.70 | thrpt(sec/sec): 9.42
epoch: 3 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 03:02:47 | bch(ms): 877.40 | smp(ms): 0.32 | fwd(ms): 348.08 | crit-fwd(ms): 9.36 | bwd(ms): 479.21 | optim(ms): 48.06 | loss: 15.47230 | train-TER: 99.45 | data/dev-TER: 100.00 | avg-isz: 206 | avg-tsz: 049 | max-tsz: 103 | hrs: 28.70 | thrpt(sec/sec): 9.42
epoch: 4 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 03:02:20 | bch(ms): 875.27 | smp(ms): 0.33 | fwd(ms): 347.21 | crit-fwd(ms): 8.73 | bwd(ms): 477.84 | optim(ms): 48.13 | loss: 6489437.52579 | train-TER: 126.84 | data/dev-TER: 100.00 | avg-isz: 206 | avg-tsz: 049 | max-tsz: 103 | hrs: 28.70 | thrpt(sec/sec): 9.44
epoch: 5 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 03:01:35 | bch(ms): 871.63 | smp(ms): 0.32 | fwd(ms): 346.02 | crit-fwd(ms): 8.11 | bwd(ms): 475.41 | optim(ms): 48.09 | loss: 1178735871332269312.00000 | train-TER: 145.57 | data/dev-TER: 79.49 | avg-isz: 206 | avg-tsz: 049 | max-tsz: 103 | hrs: 28.70 | thrpt(sec/sec): 9.48
@misbullah — can you give more information about your OS, compiler, CUDA/cuDNN version, and provide your flagsfile so I can try to reproduce?
@FredericGodin — your logs certainly seem to indicate gradient explosion. Your loss starts out at a reasonable value consistent with what we've seen. I'd try tweaking parameters/modifying maxgradnorm
or lowering your learning rate.
@jacobkahn OS: Ubuntu 16.04 Compiler: GCC 5.4.0 CUDA/CuDNN: 9.2/7.2.1.38-1
For all dependencies package, I build them standalone and make install to install them all in my OS, so that I don't need to attach path when use cmake.
Flagsfile:
cmake .. -DArrayFire_DIR=/home/asus/toolkit/arrayfire/share/ArrayFire/cmake -DINTEL_MKL_DIR=/opt/intel/mkl -DCMAKE_BUILD_TYPE=Release -DCRITERION_BACKEND=CUDA
Output: -- GTest found (library: /usr/local/lib/libgtest.a include: /usr/local/include) -- ArrayFire found (include: /home/asus/toolkit/arrayfire/include, library: ArrayFire::afcuda) -- Found glog (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so) -- GLOG found -- Found gflags (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so) -- GFLAGS found -- OpenMP found flashlight found (include: lib: flashlight::flashlight ) -- flashlight built in distributed mode. -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf_lp64: /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_gnu_thread: /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /opt/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_gf_lp64 - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf_lp64: /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_intel_thread: /opt/intel/mkl/lib/intel64/libmkl_intel_thread.so -- Library mkl_core: /opt/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_gf - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf: not found -- Checking for [mkl_gf - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_gf: not found -- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel_lp64: /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so -- Library mkl_gnu_thread: /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /opt/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_intel_lp64 - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel_lp64: /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so -- Library mkl_intel_thread: /opt/intel/mkl/lib/intel64/libmkl_intel_thread.so -- Library mkl_core: /opt/intel/mkl/lib/intel64/libmkl_core.so -- Library iomp5: not found -- Checking for [mkl_intel - mkl_gnu_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel: not found -- Checking for [mkl_intel - mkl_intel_thread - mkl_core - iomp5 - pthread - m] -- Library mkl_intel: not found -- Checking for [mkl_gf_lp64 - mkl_gnu_thread - mkl_core - pthread - m] -- Library mkl_gf_lp64: /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so -- Library mkl_gnu_thread: /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so -- Library mkl_core: /opt/intel/mkl/lib/intel64/libmkl_core.so -- Library pthread: /usr/lib/x86_64-linux-gnu/libpthread.so -- Library m: /usr/lib/x86_64-linux-gnu/libm.so -- MKL library found -- MKL found -- FFTW found -- Required SndFile dependency Ogg found. -- Required SndFile dependency Vorbis found. -- Required SndFile dependency VorbisEnc found. -- Required SndFile dependency FLAC found. -- Looking for KenLM -- Using kenlm library found in /home/asus/toolkit/kenlm/build/lib/libkenlm.a -- Using kenlm utils library found in /home/asus/toolkit/kenlm/build/lib/libkenlm.a -- kenlm lm/model.hh found in /home/asus/toolkit/kenlm/lm/model.hh -- Found kenlm (include: /home/asus/toolkit/kenlm, library: /home/asus/toolkit/kenlm/build/lib/libkenlm.a;/home/asus/toolkit/kenlm/build/lib/libkenlm_util.a) -- kenlm found -- LZMA found (library: /usr/lib/x86_64-linux-gnu/liblzma.so include: /usr/include) -- BZip2 found (library: /usr/lib/x86_64-linux-gnu/libbz2.so include: /usr/include) -- Z found (library: /usr/lib/x86_64-linux-gnu/libz.so include: /usr/include) -- CUDA found (library: /usr/local/cuda/lib64/libcudart_static.a;-lpthread;dl;/usr/lib/x86_64-linux-gnu/librt.so include: /usr/local/cuda/include) -- Adding warpctc: -- warpctc: cuda found TRUE -- warpctc: using CUDA 9.0 or above -- warpctc: Building shared library with GPU support -- Configuring done WARNING: Target "Decoder" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "Test" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "Train" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "wav2letter++" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "DctTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "CeplifterTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "DitherTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "SpeechUtilsTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "W2lModuleTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "W2lCommonTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "SoundTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "WindowTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "DerivativesTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "TriFilterbankTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "DecoderTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "CriterionTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "RuntimeTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "Seq2SeqTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "DataTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "AttentionTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "MfccTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "PreEmphasisTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. WARNING: Target "WindowingTest" requests linking to directory "/home/asus/toolkit/kenlm/build/lib". Targets may link only to libraries. CMake is dropping the item. -- Generating done -- Build files have been written to: /home/asus/toolkit/wav2letter/build
Then I can build the binary by typing make -j 6.
Thanks.
Because you mention that it has been running for one day, maybe you have the same issue as I do? For me, it started exploding after a few epochs but maybe in your case, you are not able to reach the end of the first epoch?
@misbullah — can you try running training with glog's --logtostderr=1
? Training might be running, but you might not be seeing output. It'll be useful to know if everything is running and failing after some time, or if training is hanging from the start.
@FredericGodin — could be the same problem.
@FredericGodin - The params mentioned for Librispeech recipe was using 8 gpus to train. The optimal parameters may change depending on the configuration and dataset.
(Assuming you have installed wav2letter correctly and all unit tests pass) If you are seeing issues like Loss has NaN values
, I would play around with the gflags -lr
, -lrcrit
, -maxgradnorm
to find the best hyperparams.
@jacobkahn I run use the following command based on your suggestion: /home/asus/toolkit/wav2letter/build/Train train --flagsfile /home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/train.cfg --logtostderr=1
Output: /home/asus/toolkit/wav2letter/build/Train train --flagsfile /home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/train.cfg --logtostderr=1 I0109 12:53:19.980782 396 Train.cpp:62] Reading flags from file /home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/train.cfg I0109 12:53:20.002393 396 Train.cpp:138] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --am=; --arch=network.arch; --archdir=/home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/; --attention=content; --attnWindow=no; --batchsize=2; --beamscore=25; --beamsize=2500; --channels=1; --criterion=asg; --datadir=/home/asus/toolkit/wav2letter/recipes/librispeech/data/; --dataorder=input; --devwin=0; --emission_dir=; --enable_distributed=false; --encoderdim=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=40; --flagsfile=/home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/train.cfg; --forceendsil=false; --gamma=1; --garbage=false; --input=flac; --inputbinsize=100; --inputfeeding=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=; --linlr=-1; --linlrcrit=-1; --linseg=1; --lm=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.59999999999999998; --lrcrit=0.0060000000000000001; --maxdecoderoutputlen=200; --maxgradnorm=0.20000000000000001; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.80000000000000004; --noresample=false; --nthread=2; --nthread_decoder=1; --onorm=target; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --replabel=2; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=; --runname=; --samplerate=16000; --samplingstrategy=rand; --sclite=; --seed=0; --show=false; --showletters=false; --silweight=0; --skipoov=false; --smearing=none; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=|; --tag=; --target=tkn; --targettype=video; --test=; --tokens=tokens.txt; --tokensdir=/home/asus/toolkit/wav2letter/recipes/librispeech/data/; --train=train-clean-100,train-clean-360,train-other-500; --trainWithWindow=false; --transdiag=2; --unkweight=-inf; --valid=dev-clean,dev-other; --weightdecay=0; --wordscore=1; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0109 12:53:20.002418 396 Train.cpp:140] Experiment path: 2019-01-09_12-53-19_unknown_host_10651703682468779719 I0109 12:53:20.002424 396 Train.cpp:141] Experiment runidx: 1 I0109 12:53:20.002840 396 Train.cpp:159] Number of classes (network) = 30 I0109 12:53:20.002856 396 Train.cpp:170] Loading architecture file from /home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/network.arch I0109 12:53:20.945035 396 Train.cpp:190] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> (40) -> (41) -> (42) -> (43) -> (44) -> (45) -> (46) -> (47) -> (48) -> (49) -> (50) -> (51) -> (52) -> (53) -> (54) -> (55) -> (56) -> output] (0): View (-1 1 40 0) (1): WeightNorm (Conv2D (40->400, 13x1, 1,1, 170,0) (with bias), 3) (2): GatedLinearUnit (2) (3): Dropout (0.200000) (4): WeightNorm (Conv2D (200->440, 14x1, 1,1, 0,0) (with bias), 3) (5): GatedLinearUnit (2) (6): Dropout (0.214000) (7): WeightNorm (Conv2D (220->484, 15x1, 1,1, 0,0) (with bias), 3) (8): GatedLinearUnit (2) (9): Dropout (0.228980) (10): WeightNorm (Conv2D (242->532, 16x1, 1,1, 0,0) (with bias), 3) (11): GatedLinearUnit (2) (12): Dropout (0.245009) (13): WeightNorm (Conv2D (266->584, 17x1, 1,1, 0,0) (with bias), 3) (14): GatedLinearUnit (2) (15): Dropout (0.262159) (16): WeightNorm (Conv2D (292->642, 18x1, 1,1, 0,0) (with bias), 3) (17): GatedLinearUnit (2) (18): Dropout (0.280510) (19): WeightNorm (Conv2D (321->706, 19x1, 1,1, 0,0) (with bias), 3) (20): GatedLinearUnit (2) (21): Dropout (0.300146) (22): WeightNorm (Conv2D (353->776, 20x1, 1,1, 0,0) (with bias), 3) (23): GatedLinearUnit (2) (24): Dropout (0.321156) (25): WeightNorm (Conv2D (388->852, 21x1, 1,1, 0,0) (with bias), 3) (26): GatedLinearUnit (2) (27): Dropout (0.343637) (28): WeightNorm (Conv2D (426->936, 22x1, 1,1, 0,0) (with bias), 3) (29): GatedLinearUnit (2) (30): Dropout (0.367692) (31): WeightNorm (Conv2D (468->1028, 23x1, 1,1, 0,0) (with bias), 3) (32): GatedLinearUnit (2) (33): Dropout (0.393430) (34): WeightNorm (Conv2D (514->1130, 24x1, 1,1, 0,0) (with bias), 3) (35): GatedLinearUnit (2) (36): Dropout (0.420970) (37): WeightNorm (Conv2D (565->1242, 25x1, 1,1, 0,0) (with bias), 3) (38): GatedLinearUnit (2) (39): Dropout (0.450438) (40): WeightNorm (Conv2D (621->1366, 26x1, 1,1, 0,0) (with bias), 3) (41): GatedLinearUnit (2) (42): Dropout (0.481969) (43): WeightNorm (Conv2D (683->1502, 27x1, 1,1, 0,0) (with bias), 3) (44): GatedLinearUnit (2) (45): Dropout (0.515707) (46): WeightNorm (Conv2D (751->1652, 28x1, 1,1, 0,0) (with bias), 3) (47): GatedLinearUnit (2) (48): Dropout (0.551806) (49): WeightNorm (Conv2D (826->1816, 29x1, 1,1, 0,0) (with bias), 3) (50): GatedLinearUnit (2) (51): Dropout (0.590433) (52): Reorder (2,0,3,1) (53): WeightNorm (Linear (908->1816) (with bias), 0) (54): GatedLinearUnit (0) (55): Dropout (0.590433) (56): WeightNorm (Linear (908->30) (with bias), 0) I0109 12:53:20.945087 396 Train.cpp:191] [Network Params: 208863942] I0109 12:53:20.945107 396 Train.cpp:192] [Criterion] AutoSegmentationCriterion I0109 12:53:20.945152 396 Train.cpp:201] [Criterion] LinearSegmentationCriterion (for first 1 epochs) I0109 12:53:20.945451 396 NumberedFilesLoader.cpp:29] Adding dataset /home/asus/toolkit/wav2letter/recipes/librispeech/data/train-clean-100 ... I0109 12:53:20.956066 396 NumberedFilesLoader.cpp:68] 28539 files found. I0109 12:53:20.956076 396 NumberedFilesLoader.cpp:29] Adding dataset /home/asus/toolkit/wav2letter/recipes/librispeech/data/train-clean-360 ... I0109 12:53:20.956301 396 NumberedFilesLoader.cpp:68] 104014 files found. I0109 12:53:20.956308 396 NumberedFilesLoader.cpp:29] Adding dataset /home/asus/toolkit/wav2letter/recipes/librispeech/data/train-other-500 ... I0109 12:53:20.956519 396 NumberedFilesLoader.cpp:68] 148688 files found. I0109 13:01:06.092253 396 Utils.cpp:102] Filtered 0/281241 samples I0109 13:01:06.122606 396 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 140621 I0109 13:01:06.122756 396 NumberedFilesLoader.cpp:29] Adding dataset /home/asus/toolkit/wav2letter/recipes/librispeech/data/dev-clean ... I0109 13:01:06.122949 396 NumberedFilesLoader.cpp:68] 2703 files found. I0109 13:01:06.252243 396 Utils.cpp:102] Filtered 0/2703 samples I0109 13:01:06.252511 396 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 1352 I0109 13:01:06.252637 396 NumberedFilesLoader.cpp:29] Adding dataset /home/asus/toolkit/wav2letter/recipes/librispeech/data/dev-other ... I0109 13:01:06.252900 396 NumberedFilesLoader.cpp:68] 2864 files found. I0109 13:01:06.283288 396 Utils.cpp:102] Filtered 0/2864 samples I0109 13:01:06.283576 396 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 1432 I0109 13:01:06.389232 396 Train.cpp:440] Shuffling trainset I0109 13:01:06.399230 396 Train.cpp:447] Epoch 1 started!
The process stuck there, no any further process.
Any suggestion?
Thanks
Hi,
It is not stuck. The model has started training and will output results after completing an epoch.
For debugging, You can specify -reportiters=100
, so that you can see output after processing every 100 samples.
Hi @vineelpratap, I specified the option -reportiters=100 as you suggest, but there is also no any further process. Even I change the value from 100 to be 1.
Thanks.
Hi @vineelpratap, it can work now to show output like the following.
I0110 09:31:28.321411 9934 Train.cpp:447] Epoch 1 started! I0110 09:35:21.071099 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:04 | bch(ms): 4409.71 | smp(ms): 45.76 | fwd(ms): 1617.31 | crit-fwd(ms): 248.14 | bwd(ms): 1183.14 | optim(ms): 1171.62 | loss: 75.53625 | train-TER: 99.57 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1593 | avg-tsz: 255 | max-tsz: 255 | hrs: 0.01 | thrpt(sec/sec): 7.22 I0110 09:39:14.963091 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 779.86 | smp(ms): 0.16 | fwd(ms): 199.92 | crit-fwd(ms): 12.11 | bwd(ms): 546.26 | optim(ms): 32.17 | loss: 44.35780 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 589 | avg-tsz: 082 | max-tsz: 082 | hrs: 0.00 | thrpt(sec/sec): 15.11 I0110 09:43:30.351867 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 832.61 | smp(ms): 0.29 | fwd(ms): 294.69 | crit-fwd(ms): 42.58 | bwd(ms): 507.20 | optim(ms): 27.72 | loss: 65.45861 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1315 | avg-tsz: 236 | max-tsz: 236 | hrs: 0.01 | thrpt(sec/sec): 31.59 I0110 09:47:27.268770 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 975.12 | smp(ms): 0.26 | fwd(ms): 365.65 | crit-fwd(ms): 68.95 | bwd(ms): 578.63 | optim(ms): 27.58 | loss: 68.15084 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1584 | avg-tsz: 219 | max-tsz: 219 | hrs: 0.01 | thrpt(sec/sec): 32.49 I0110 09:51:33.794992 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 884.70 | smp(ms): 0.28 | fwd(ms): 326.12 | crit-fwd(ms): 52.52 | bwd(ms): 528.01 | optim(ms): 27.73 | loss: 60.04827 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1369 | avg-tsz: 209 | max-tsz: 209 | hrs: 0.01 | thrpt(sec/sec): 30.95 I0110 09:56:02.375025 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:01 | bch(ms): 1295.75 | smp(ms): 0.43 | fwd(ms): 481.61 | crit-fwd(ms): 178.08 | bwd(ms): 776.41 | optim(ms): 31.01 | loss: 66.97047 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1591 | avg-tsz: 271 | max-tsz: 271 | hrs: 0.01 | thrpt(sec/sec): 24.56 I0110 10:00:32.688040 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 471.08 | smp(ms): 18.36 | fwd(ms): 194.36 | crit-fwd(ms): 12.15 | bwd(ms): 228.26 | optim(ms): 28.46 | loss: 28.03275 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 284 | avg-tsz: 053 | max-tsz: 053 | hrs: 0.00 | thrpt(sec/sec): 12.06 I0110 10:05:02.465401 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:01 | bch(ms): 1027.92 | smp(ms): 0.38 | fwd(ms): 372.83 | crit-fwd(ms): 123.55 | bwd(ms): 602.47 | optim(ms): 46.90 | loss: 55.45941 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1143 | avg-tsz: 200 | max-tsz: 200 | hrs: 0.01 | thrpt(sec/sec): 22.24 I0110 10:09:33.238551 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 607.98 | smp(ms): 0.30 | fwd(ms): 262.85 | crit-fwd(ms): 59.04 | bwd(ms): 311.86 | optim(ms): 30.81 | loss: 31.37218 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 471 | avg-tsz: 070 | max-tsz: 070 | hrs: 0.00 | thrpt(sec/sec): 15.49 I0110 10:14:04.423800 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:01 | bch(ms): 1181.19 | smp(ms): 12.01 | fwd(ms): 484.27 | crit-fwd(ms): 198.95 | bwd(ms): 633.15 | optim(ms): 39.47 | loss: 58.23077 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1487 | avg-tsz: 188 | max-tsz: 188 | hrs: 0.01 | thrpt(sec/sec): 25.18 I0110 10:18:34.900118 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:01 | bch(ms): 1087.83 | smp(ms): 0.37 | fwd(ms): 419.92 | crit-fwd(ms): 146.38 | bwd(ms): 611.25 | optim(ms): 51.10 | loss: 54.84644 | train-TER: 99.76 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1279 | avg-tsz: 226 | max-tsz: 226 | hrs: 0.01 | thrpt(sec/sec): 23.51 I0110 10:23:09.053712 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:01 | bch(ms): 1210.90 | smp(ms): 25.08 | fwd(ms): 482.85 | crit-fwd(ms): 190.75 | bwd(ms): 667.32 | optim(ms): 30.11 | loss: 56.87938 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1507 | avg-tsz: 216 | max-tsz: 216 | hrs: 0.01 | thrpt(sec/sec): 24.89 I0110 10:27:46.300349 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 973.28 | smp(ms): 22.60 | fwd(ms): 398.00 | crit-fwd(ms): 158.01 | bwd(ms): 508.65 | optim(ms): 40.00 | loss: 47.13281 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1004 | avg-tsz: 175 | max-tsz: 175 | hrs: 0.01 | thrpt(sec/sec): 20.63 I0110 10:32:21.205981 9934 Train.cpp:242] epoch: 1 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:01 | bch(ms): 1059.22 | smp(ms): 29.77 | fwd(ms): 398.35 | crit-fwd(ms): 119.62 | bwd(ms): 594.78 | optim(ms): 31.38 | loss: 55.51303 | train-TER: 100.00 | dev-other-TER: 100.00 | dev-clean-TER: 100.00 | avg-isz: 1346 | avg-tsz: 189 | max-tsz: 189 | hrs: 0.01 | thrpt(sec/sec): 25.41
I use -reportiters=1. I think the training is too slow. Do you know why?
Thanks
@misbullah — it's hard to say why training might be slow - it might have to do with your GPU/CPU hardware or how you build flashlight/wav2letter++ (Debug
versus Release
configurations with CMake). If you're trying to reproduce the training speed in our paper, you can see the hardware we used there and adjust your expectations accordingly.
Because I did not obtain good results with my own data, I trained on the full LibriSpeech corpus. I also have the same issue... Any suggestion why this happens with the recipe on the same dataset?
I0115 16:35:39.593119 21623 Train.cpp:443] Epoch 1 started! F0116 21:26:00.596745 21623 Train.cpp:468] Loss has NaN values Check failure stack trace: @ 0x7f86998164ed google::LogMessage::Fail() @ 0x7f8699818aa3 google::LogMessage::SendToLog() @ 0x7f869981607b google::LogMessage::Flush() @ 0x7f86998179ee google::LogMessageFatal::~LogMessageFatal() @ 0x45bea6 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEERNS0_19FirstOrderOptimizerES9_biE4_clES2_S5_S7_S9_S9_bi.constprop.8719 @ 0x418adb main @ 0x7f8621c43830 __libc_start_main @ 0x457339 _start Aborted (core dumped)
@FredericGodin this is usually because of divergence/over or underflow of network emissions with criterion. I'd try continuing to decrease your learning rate/criterion learning rate if you're using ASG.
I also was able to build wav2letter++ from master and run Librispeech recipe using Train command. But I got the following error:
asus@asus-M51AC:~/toolkit/wav2letter$ /home/asus/toolkit/wav2letter/build/Train train --flagsfile /home/asus/toolkit/wav2letter/recipes/librispeech/config/conv_glu/train.cfg --rundir recipes/librispeech/ F0104 23:37:18.678984 7164 Train.cpp:472] Loss has NaN values Check failure stack trace: @ 0x7fde58c535cd google::LogMessage::Fail() @ 0x7fde58c55433 google::LogMessage::SendToLog() @ 0x7fde58c5315b google::LogMessage::Flush() @ 0x7fde58c55e1e google::LogMessageFatal::~LogMessageFatal() @ 0x45b9d8 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEERNS0_19FirstOrderOptimizerES9_biE4_clES2_S5_S7_S9_S9_bi.constprop.8990 @ 0x418c2b main @ 0x7fddfeb1a830 __libc_start_main @ 0x457359 _start @ (nil) (unknown) Aborted (core dumped)
The process is running almost 1 day without any output or progress, then show the above error.
I use the specification for CUDA, NCCL, CuDNN and FlashLight that mentioned in CMakeLists.txt. I run it on Ubuntu 16.04.
Any suggestion?
Thanks.