ERROR:tensorflow:Model diverged with loss = NaN.

jiangweiatgithub commented 6 years ago

I got this error when training on a corpus of 100 lines. I did not change any of the configurations that come with the default git.

ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\ProgramData\Anaconda3\envs\tf-cpu-18\Scripts\onmt-main.exe__main__.py", line 9, in File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\opennmt\bin\main.py", line 133, in main runner.train_and_evaluate() File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\opennmt\runner.py", line 148, in train_and_evaluate tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\training.py", line 439, in train_and_evaluate executor.run() File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\training.py", line 518, in run self.run_local() File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\training.py", line 650, in run_local hooks=train_hooks) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 859, in _train_model_default saving_listeners) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1059, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 567, in run run_metadata=run_metadata) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1043, in run run_metadata=run_metadata) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1134, in run raise six.reraise(original_exc_info) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\six.py", line 693, in reraise raise value File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1119, in run return self._sess.run(args, **kwargs) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1199, in run run_metadata=run_metadata)) File "d:\programdata\anaconda3\envs\tf-cpu-18\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 623, in after_run raise NanLossDuringTrainingError tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

guillaumekln commented 6 years ago

Which configuration are you referring precisely?

Looks like the optimization parameters are not tuned for your task/dataset.

jiangweiatgithub commented 6 years ago

I used the default one, the one that comes with the git. As the training on the full dataset appeared to be taking a long time, I created a subset of 100 sentences from it and gave it a try, and thus got this error. Am I supposed to change any of the settings, in order to cater to this small dataset?

guillaumekln commented 6 years ago

You should give the actual command line you executed and any custom configuration files. It's not clear to me what you did exactly.

jiangweiatgithub commented 6 years ago

cd OpenNMT-tf onmt-build-vocab --size 50000 --save_vocab data/toy-ende/src-vocab.txt data/toy-ende/src-train.txt onmt-build-vocab --size 50000 --save_vocab data/toy-ende/tgt-vocab.txt data/toy-ende/tgt-train.txt onmt-main train_and_eval --model_type NMTSmall --config config/opennmt-defaults.yml config/data/toy-ende.yml

guillaumekln commented 6 years ago

After how many steps the NaN error was raised? Even on a 100 lines corpus, my training runs without issue.

If you can share the complete procedure to reproduce the issue starting from a fresh OpenNMT-tf installation that could be helpful.

guillaumekln commented 6 years ago

Closing due to lack of activity. Feel free to reopen this issue if you can give steps to reproduce this behavior.

Lara192 commented 5 years ago

Hi I am getting the same error here with the multi_source_nmt training, right from the first step:

params:
  average_loss_in_time: true
  beam_width: 5
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_rate: 0.5
  decay_step_duration: 1
  decay_type: noam_decay_v2
  label_smoothing: 0.1
  learning_rate: 1.0
  length_penalty: 0.6
  loss_scale: logmax
  optimizer: GradientDescentOptimizer
  optimizer_params:
    beta1: 0.9
    beta2: 0.998
  scale_min: 1.0
  staircase: true
  step_factor: 2.0
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 64
  batch_type: tokens
  bucket_width: 1
  effective_batch_size: 25000
  keep_checkpoint_max: 50
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: 1000
  save_checkpoints_steps: 300
  save_summary_steps: 100
  single_pass: false

I tried to change the loss_scale with different options, still same problem.

When I decrease the learning rate, it trains for a certain amount of steps and then diverges again, any idea on why this might happen?

guillaumekln commented 5 years ago

Your params section is a mix of everything, not sure how you came up with that. Unless you know what you are doing, start with a simpler configuration for example:

params:
  optimizer: AdamOptimizer
  learning_rate: 0.0002
  beam_width: 5
train:
  batch_size: 64
  bucket_width: 1
  maximum_features_length: 80
  maximum_labels_length: 80
  sample_buffer_size: 100000
  train_steps: 500000

Lara192 commented 5 years ago

For the params section, I am specifically interested in a learning rate of 1.0, Gradient Descent, and a learning decay of 0.5, I only added the label_smoothing value to be : 0.1, and almost all other options were set by themselves. For the sake of my experiments, I tried to remove all the parameters that I never specified and set everything to default values -except for what I specifically need- , it didn't work.

Then I tried with the following params:

  beam_width: 5
  decay_rate: 0.5
  decay_step_duration: 1
  label_smoothing: 0.1
  learning_rate: 1.0
  length_penalty: 0
  optimizer: AdamOptimizer

train:
  batch_size: 64
  batch_type: examples
  bucket_width: 1
  keep_checkpoint_max: 50
  maximum_features_length: 80
  maximum_labels_length: 80
  sample_buffer_size: 100000
  save_checkpoints_steps: 300
  single_pass: false
  train_steps: 500000

It crashed.

Now I modified the parameters to the ones you specified, with lr 0.0002, it would be interesting to see if the training is completed without problems. I think that it will stop after a number of steps though, just like before. I will keep you updated. Nonetheless, It would be nice to be able to apply the Gradient Descent Optimizer with lr 1.0 for my study case...

OpenNMT / OpenNMT-tf

ERROR:tensorflow:Model diverged with loss = NaN. #156