Failed to train LibriSpeech using example script.

misbullah commented 6 years ago

Hi, I tried to train model using all LibriSpeech dataset (960H) with data augmentation. I use 2 GPU GTX 1080. After Pass 10, I got the following error.

I0110 17:18:34.314939 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_4_.w0 need clipping by local threshold=400, max grad =1.86242e+09, avg grad=7.7372e+07 I0110 17:18:34.315071 47267 FirstOrderOptimizer.cpp:321] parameter=batch_norm_4.wbias need clipping by local threshold=400, max g rad=2.21661e+09, avg grad=7.25256e+07 I0110 17:18:34.319563 47267 FirstOrderOptimizer.cpp:321] parameter=_recurrent_layer_4.w0 need clipping by local threshold=400, max grad=4.5849e+09, avg grad=7.94203e+06 I0110 17:18:34.319880 47267 FirstOrderOptimizer.cpp:321] parameter=_recurrent_layer_4.wbias need clipping by local threshold=400, max grad=2.21661e+09, avg grad=7.25256e+07 .I0110 17:18:34.418779 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_0.w0 need clipping by local threshold=400, max gra d=942.53, avg grad=156.387 I0110 17:18:34.418969 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_0.wbias need clipping by local threshold=400, max g rad=1613.23, avg grad=205.539 I0110 17:18:34.419972 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_1.w0 need clipping by local threshold=400, max grad =1028.06, avg grad=339.691 I0110 17:18:34.420138 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_1.wbias need clipping by local threshold=400, max g rad=1713.61, avg grad=505.019 I0110 17:18:34.426007 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_2.w0 need clipping by local threshold=400, max grad =905.969, avg grad=26.9515 I0110 17:18:34.426167 47267 FirstOrderOptimizer.cpp:321] parameter=_batch_norm_2.wbias need clipping by local threshold=400, max g rad=1077.66, avg grad=32.965 I0110 17:18:34.430650 47267 FirstOrderOptimizer.cpp:321] parameter=_recurrent_layer_0.w0 need clipping by local threshold=400, max grad=2006.8, avg grad=1.30195 I0110 17:18:34.430976 47267 FirstOrderOptimizer.cpp:321] parameter=_recurrent_layer_0__.wbias need clipping by local threshold=400, max grad=1071.72, avg grad=31.856 . Aborted at 1515575915 (unix time) try "date -d @1515575915" if you are using GNU date [37/1957] PC: @ 0x0 (unknown) SIGFPE (@0x7fc03283ad89) received by PID 47267 (TID 0x7fc0b121c740) from PID 847490441; stack trace: @ 0x7fc0b0e11330 (unknown) @ 0x7fc03283ad89 paddle::GpuVectorT<>::getAbsMax() @ 0x7fc032afbef6 paddle::OptimizerWithGradientClipping::update() @ 0x7fc032ae1ddd paddle::SgdThreadUpdater::updateImpl() @ 0x7fc03299ed51 ParameterUpdater::update() @ 0x7fc03257a336 _wrap_ParameterUpdater_update @ 0x52714b PyEval_EvalFrameEx @ 0x555551 PyEval_EvalCodeEx @ 0x525560 PyEval_EvalFrameEx @ 0x555551 PyEval_EvalCodeEx @ 0x524338 PyEval_EvalFrameEx @ 0x555551 PyEval_EvalCodeEx @ 0x524338 PyEval_EvalFrameEx @ 0x555551 PyEval_EvalCodeEx @ 0x525560 PyEval_EvalFrameEx @ 0x555551 PyEval_EvalCodeEx @ 0x525560 PyEval_EvalFrameEx @ 0x567d14 (unknown) @ 0x465bf4 PyRun_FileExFlags @ 0x46612d PyRun_SimpleFileExFlags @ 0x466d92 Py_Main @ 0x7fc0b0a59f45 __libc_start_main @ 0x577c2e (unknown) @ 0x0 (unknown) run_train.sh: line 35: 47267 Floating point exception(core dumped) CUDA_VISIBLE_DEVICES=0,2 python -u train.py --init_model_path='/var /nlp/alim/paddle-deepspeech/checkpoints/libri/params.latest.tar.gz' --batch_size=16 --trainer_count=2 --num_passes=20 --num_proc_data= 16 --num_conv_layers=2 --num_rnn_layers=3 --rnn_layer_size=1024 --num_iter_print=100 --learning_rate=5e-4 --max_duration=27.0 --min_du ration=0.0 --test_off=False --use_sortagrad=True --use_gru=False --use_gpu=True --is_local=True --share_rnn_weights=True --train_manif est='data/librispeech/manifest.train' --dev_manifest='data/librispeech/manifest.dev-clean' --mean_std_path='data/librispeech/mean_std. npz' --vocab_path='data/librispeech/vocab.txt' --output_model_dir='./checkpoints/libri' --augment_conf_path='conf/augmentation.config' --specgram_type='linear' --shuffle_method='batch_shuffle_clipped' Failed in training!

Any suggestion?

Thanks, Alim

kuke commented 6 years ago

@misbullah Seems that floating point exception occurs. Now that model has been successfully trained for several passes, you can try to resume training from the model saved in last pass or latest-saved.

misbullah commented 6 years ago

@kuke I tried to resume from latest model using checkpoint option in train.py but the same error still occur after some batches in first pass.

Thanks, Alim

kuke commented 6 years ago

@misbullah can you post the validation loss? I guess that the float point exception may result from gradient explosion, and the training doesn't converge any more.

misbullah commented 6 years ago

@kuke,

I don't have any record about validation loss anymore because I run the experiment in tmux screen on ubuntu, so I don't keep any log. But, as you said that the gradient explosion may happened because the training loss become larger and larger.

Is there any option to use different layer like fully-connected (ReLU), LSTM layer and TDNN (Time-Delay Neural Network) with ReLU activation function? I mentioned this because I found those layers in Kaldi toolkit that also use for speech recognition.

One more question. Does DeepSpeech can record log by passing the option during training process?

Thanks, Alim

kuke commented 6 years ago

@misbullah

For your 1st question, if you go deep into this project you would find that the model structure can be changed in network.py to use these layers.

For your 2nd question, no option yet but you can redirect the output into some file, it's quite easy by running training in the way below

nohup sh run_train.sh > log.txt 2>&1&

PaddlePaddle / PaddleSpeech

Failed to train LibriSpeech using example script. #114