RuntimeError: Model diverged with loss = NaN.

FPBHW commented 2 years ago

Hi,

this might be more of a question than an issue, but I'm observing loss = NaN, and am not clear how this could be, at least not from a theoretical standpoint.

It is possible for loss to be "INFinite" when a prediction is a hard 0 or 1, and the target is the opposite. However, a NaN should - as far as I understand - not occur.

In summary:

-log(0) = inf (not observed, inconvenient, but understandable)
-log(x) = NaN -> Observed, but I do not know for which x this could happen (assuming x >= 0)

Does anyone have an explanation ?

Using classic SGD OpenNMT-tf-2.20.1, NaN loss encountered after 49900 steps.

Many thanks in advance, Fokko

FPBHW commented 2 years ago

Hypothesis: inside the model, positive and negative INFs are generated and then summed, leading to a prediction of NaN. If this is the case, it would be better to report the prediction as NaN.

guillaumekln commented 2 years ago

Hypothesis: inside the model, positive and negative INFs are generated and then summed, leading to a prediction of NaN.

Yes, for example:

>>> import tensorflow as tf
>>> x = tf.constant([float("inf")], dtype=tf.float32)
>>> tf.nn.softmax(x)
<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>

If this is the case, it would be better to report the prediction as NaN.

What do you mean exactly? What should be improved in the current way of reporting a training that diverges?

FPBHW commented 2 years ago

Thank you - indeed, a input of inf would indeed also cause a prediciton of NaN (and thus a loss of NaN). What I meant to say is that it would be nice to have feedback about a NaN (or even inf) as early as possible, so it is more clear what the cause is.

If this the gradient is +-Inf for a particular sample, it would be very helpful if that sample is flagged. This is if the divergence can be explained by bad data.

This issue popped up in 2 different trainings with different data and SGD methods, and both in a training with a gradually decreasing loss of moderate scale, that then suddenly diverged. So the NaN is a surprise.

A potential alternative hypothesis is that there is a bug somewhere (although data issues are more likely).

Do you have a suggestion how we would get insight into what suddenly caused the model to diverge while it seemed to be converging normally ? Particularly how to flag the data under consideration at that moment ?

FPBHW commented 2 years ago

Example:

2021-10-07 17:46:13.224000: I runner.py:281] Step = 1238300 ; steps/s = 3.98, source words/s = 280434, target words/s = 307762 ; Learning rate = 0.000079 ; Loss = 1.851973
2021-10-07 17:46:39.361000: I runner.py:281] Step = 1238400 ; steps/s = 3.83, source words/s = 297033, target words/s = 326020 ; Learning rate = 0.000079 ; Loss = 1.801115
Traceback (most recent call last):
  File "/root/miniconda3/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/bin/main.py", line 326, in main
    hvd=hvd,
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/runner.py", line 281, in train
    moving_average_decay=train_config.get("moving_average_decay"),
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 123, in __call__
    dataset, accum_steps=accum_steps, report_steps=report_steps
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 268, in _steps
    raise RuntimeError("Model diverged with loss = NaN.")
RuntimeError: Model diverged with loss = NaN.
[Modelarts Service Log]Training completed.

FPBHW commented 2 years ago

Another example, this is with classic SGD:

2021-10-05 22:55:36.368882: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 10 of 500000
2021-10-05 22:55:38.246023: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:230] Shuffle buffer filled.
2021-10-05 22:55:52.652000: I runner.py:281] Step = 49800 ; steps/s = 2.93, source words/s = 52750, target words/s = 42947 ; Learning rate = 0.010000 ; Loss = 4.686182
2021-10-05 22:56:27.645000: I runner.py:281] Step = 49900 ; steps/s = 2.86, source words/s = 52161, target words/s = 42367 ; Learning rate = 0.010000 ; Loss = 4.904016
Traceback (most recent call last):
  File "/root/miniconda3/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/bin/main.py", line 326, in main
    hvd=hvd,
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/runner.py", line 281, in train
    moving_average_decay=train_config.get("moving_average_decay"),
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 123, in __call__
    dataset, accum_steps=accum_steps, report_steps=report_steps
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 268, in _steps
    raise RuntimeError("Model diverged with loss = NaN.")
RuntimeError: Model diverged with loss = NaN.
[Modelarts Service Log]Training completed.

FPBHW commented 2 years ago

Using SGD we see the impact of outliers quite clearly, but the sudden divergence is still unexpected. If there is a any way to establish if this a data issue, that would be very welcome.

Any suggestions ?

guillaumekln commented 2 years ago

Misaligned data can indeed cause spikes in the training loss and then exploding gradients if there is no counter measure. Since it is usually difficult to ensure that all examples in the dataset are cleaned enough, the training parameters should be tuned to prevent divergence from happening.

For example, with SGD it is usually required to clip the gradients to prevent them from exploding. Are you doing that? (See the clip* optimizer parameters.)

FPBHW commented 2 years ago

I disabled clipping specifically to be able to determine the root cause of the NaN ...

Is there a way to get the sample or batch, and the associated loss per sample at the moment of the crash ?

guillaumekln commented 2 years ago

The training samples seen at the moment of the crash may not be the outliers you are looking for, because the divergence may have been started a few iterations before.

A better way to check your training data for outliers is to train a standard model (e.g. a Transformer with --auto_config) and then run the score command on your training data. The outliers should have a high score relative to the other examples.

FPBHW commented 2 years ago

I'm currently trying to run score on the data to figure out which examples are causing the issue.

This resulted in the following crash:

2021-10-11 19:29:13.447309: F tensorflow/core/kernels/softmax_op_gpu.cu.cc:275] Non-OK-status: GpuLaunchKernel( GenerateNormalizedProb<T, acc_type, kUnroll>, numBlocks, numThreadsPerBlock, 0, cu_stream, reinterpret_cast<const T*>(logits_in_.flat<T>().data()), reinterpret_cast<const acc_type*>( sum_probs.flat<acc_type>().data()), reinterpret_cast<const T*>(max_logits.flat<T>().data()), const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_) status: Internal: invalid configuration argument
opennmt-training/src/score-filter.sh: line 154:   663 Aborted                 onmt-main --config ${SCORE_CONFIG} --auto_config --checkpoint_path ${LOCAL_MODEL_DIR} ${MODEL_PARAMETER} score --features_file ${LOCAL_TOKEN_DIR}/${TRAIN_SRC_TOK} --predictions_file ${LOCAL_TOKEN_DIR}/${TRAIN_TGT_TOK} > ${PREDICTION_FILE}
[Modelarts Service Log]Training completed.

The configuration is:

data:
  source_vocabulary: /cache/MT/vocab/shrd5k-v0/en-ps.ps.vocab
  target_vocabulary: /cache/MT/vocab/shrd5k-v0/en-ps.en.vocab
eval:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
infer:
  batch_size: 150
  batch_type: examples
  length_bucket_width: 5
model_dir: /cache/MT/model
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
  replace_unknown_target: true
  sampling_temperature: 1
  sampling_topk: 1
score:
  batch_size: 64
  batch_type: examples
  length_bucket_width: 5
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100

Is this expected given the configuration ?

guillaumekln commented 2 years ago

Maybe an empty line in the data file? The score task does not apply any filtering.

FPBHW commented 2 years ago

For at least 1 job with this error I am certain that there are no empty lines. If I have updates on this crash, shall I post them in #523 ?

guillaumekln commented 2 years ago

Yes, please post updates on the other issue.

I'm going to close this one as there is not much to add about the NaN loss. Feel free to create other issues if you encounter other problems.

OpenNMT / OpenNMT-tf

RuntimeError: Model diverged with loss = NaN. #890