PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.24k stars 5.59k forks source link

Semantic Role Labeling - Unable to train beyond 120 passes #720

Closed chaitjo closed 7 years ago

chaitjo commented 7 years ago

I'm trying out the demo for Semantic Role Labeling from here and ran into some problems while running training.

I set the number of passes as 500 in train.sh. Here are the parameters I used for training after downloading the data -

set -e
paddle train \
  --config=./db_lstm.py \
  --use_gpu=0 \
  --log_period=5000 \
  --trainer_count=1 \
  --show_parameter_stats_period=5000 \
  --save_dir=./output \
  --num_passes=500 \
  --average_test_period=10000000 \
  --init_model_path=./data \
  --load_missing_parameter_strategy=rand \
  --test_all_data_in_one_period=1 \
  2>&1 | tee 'train.log'

I'm using a linux virtual machine with 2 GB ram.

I encountered the following error twice, each time on the 120th pass -

I1203 15:01:13.719069 27973 TrainerInternal.cpp:180]  Pass=120 Batch=1110 samples=148647 AvgCost=0.178114 Eval: __sum_evaluator_0__=0.130679 
I1203 15:02:24.070485 27973 Tester.cpp:111]  Test samples=148647 cost=0.140363 Eval: __sum_evaluator_0__=0.0888548 
I1203 15:02:24.093823 27973 GradientMachine.cpp:112] Saving parameters to ./output/pass-00120
I1203 15:02:24.150364 27973 Util.cpp:226] copy ./db_lstm.py to ./output/pass-00120
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Thread [139777238976384] Forwarding hidden0, __embedding_6__, __embedding_5__, __embedding_4__, __embedding_3__, __embedding_2__, __embedding_1__, word_ctx-in_embedding, __embedding_0__, target, mark_data, ctx_p2_data, ctx_p1_data, ctx_0_data, ctx_n1_data, ctx_n2_data, verb_data, word_data, 
*** Aborted at 1480777543 (unix time) try "date -d @1480777543" if you are using GNU date ***
PC: @     0x7f206a3b6fc1 (unknown)
*** SIGFPE (@0x7f206a3b6fc1) received by PID 27973 (TID 0x7f206cad4780) from PID 1782280129; stack trace: ***
    @     0x7f206c3b8330 (unknown)
    @     0x7f206a3b6fc1 (unknown)
/usr/bin/paddle: line 81: 27973 Floating point exception(core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

The entire train.log file can be found here.

I believe I can still use the 120th model checkpoint but am yet to try it out.

Why am I unable to train beyond this point and how do I overcome this?

hedaoyuan commented 7 years ago

@zhangjcqq is the author of this demo.

zhangjcqq commented 7 years ago

The parameter configuration is provided for training the CoNLL dataset, while that training data set is not public. Therefore, we use its test part in this demo, just show how to re-implement the related paper's experiments. Actually, running this demo in multiple pass is less meaningful for that user can not obtain a practicable system without big training data set. The float exception is caused by numerical overflow which comes from operating system. Maybe, we need to provide a more friendly approach to deal with it in the future. Gradient clipping strategy and proper parameter such as initialized weights and learning rate are useful to avoid that exception.

Yancey1989 commented 7 years ago

I'll close this issue, if there is an update, please reopen it.