Closed jiangweiatgithub closed 6 years ago
Which configuration are you referring precisely?
Looks like the optimization parameters are not tuned for your task/dataset.
I used the default one, the one that comes with the git. As the training on the full dataset appeared to be taking a long time, I created a subset of 100 sentences from it and gave it a try, and thus got this error. Am I supposed to change any of the settings, in order to cater to this small dataset?
You should give the actual command line you executed and any custom configuration files. It's not clear to me what you did exactly.
cd OpenNMT-tf onmt-build-vocab --size 50000 --save_vocab data/toy-ende/src-vocab.txt data/toy-ende/src-train.txt onmt-build-vocab --size 50000 --save_vocab data/toy-ende/tgt-vocab.txt data/toy-ende/tgt-train.txt onmt-main train_and_eval --model_type NMTSmall --config config/opennmt-defaults.yml config/data/toy-ende.yml
After how many steps the NaN error was raised? Even on a 100 lines corpus, my training runs without issue.
If you can share the complete procedure to reproduce the issue starting from a fresh OpenNMT-tf installation that could be helpful.
Closing due to lack of activity. Feel free to reopen this issue if you can give steps to reproduce this behavior.
Hi I am getting the same error here with the multi_source_nmt training, right from the first step:
params:
average_loss_in_time: true
beam_width: 5
decay_params:
model_dim: 512
warmup_steps: 8000
decay_rate: 0.5
decay_step_duration: 1
decay_type: noam_decay_v2
label_smoothing: 0.1
learning_rate: 1.0
length_penalty: 0.6
loss_scale: logmax
optimizer: GradientDescentOptimizer
optimizer_params:
beta1: 0.9
beta2: 0.998
scale_min: 1.0
staircase: true
step_factor: 2.0
score:
batch_size: 64
train:
average_last_checkpoints: 8
batch_size: 64
batch_type: tokens
bucket_width: 1
effective_batch_size: 25000
keep_checkpoint_max: 50
maximum_features_length: 100
maximum_labels_length: 100
sample_buffer_size: 1000
save_checkpoints_steps: 300
save_summary_steps: 100
single_pass: false
I tried to change the loss_scale with different options, still same problem.
When I decrease the learning rate, it trains for a certain amount of steps and then diverges again, any idea on why this might happen?
Your params
section is a mix of everything, not sure how you came up with that. Unless you know what you are doing, start with a simpler configuration for example:
params:
optimizer: AdamOptimizer
learning_rate: 0.0002
beam_width: 5
train:
batch_size: 64
bucket_width: 1
maximum_features_length: 80
maximum_labels_length: 80
sample_buffer_size: 100000
train_steps: 500000
For the params
section, I am specifically interested in a learning rate of 1.0, Gradient Descent, and a learning decay of 0.5, I only added the label_smoothing value to be : 0.1, and almost all other options were set by themselves. For the sake of my experiments, I tried to remove all the parameters that I never specified and set everything to default values -except for what I specifically need- , it didn't work.
Then I tried with the following params:
beam_width: 5
decay_rate: 0.5
decay_step_duration: 1
label_smoothing: 0.1
learning_rate: 1.0
length_penalty: 0
optimizer: AdamOptimizer
train:
batch_size: 64
batch_type: examples
bucket_width: 1
keep_checkpoint_max: 50
maximum_features_length: 80
maximum_labels_length: 80
sample_buffer_size: 100000
save_checkpoints_steps: 300
single_pass: false
train_steps: 500000
It crashed.
Now I modified the parameters to the ones you specified, with lr 0.0002, it would be interesting to see if the training is completed without problems. I think that it will stop after a number of steps though, just like before. I will keep you updated. Nonetheless, It would be nice to be able to apply the Gradient Descent Optimizer with lr 1.0 for my study case...
I got this error when training on a corpus of 100 lines. I did not change any of the configurations that come with the default git.
ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\ProgramData\Anaconda3\envs\tf-cpu-18\Scripts\onmt-main.exe__main__.py", line 9, in
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\opennmt\bin\main.py", line 133, in main
runner.train_and_evaluate()
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\opennmt\runner.py", line 148, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\training.py", line 439, in train_and_evaluate
executor.run()
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\training.py", line 518, in run
self.run_local()
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\training.py", line 650, in run_local
hooks=train_hooks)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 859, in _train_model_default
saving_listeners)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1059, in _train_with_estimatorspec
, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 567, in run
run_metadata=run_metadata)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1043, in run
run_metadata=run_metadata)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1134, in run
raise six.reraise(original_exc_info)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\six.py", line 693, in reraise
raise value
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1119, in run
return self._sess.run(args, **kwargs)
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1199, in run
run_metadata=run_metadata))
File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 623, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.