Loss NaN results for run_clm

aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications

Other

101 stars 32 forks source link

Got the run_clm.py to compile on trn1.32xlarge and also run the actual training. However, it shows loss-NaN and perplexily NaN results. has this been observed? The directions I followed are from here

/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/core/_methods.py:178: RuntimeWarning: invalid value encountered in reduce
  ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
^M100%|██████████| 2/2 [00:00<00:00,  2.55it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_loss               =        nan
  eval_runtime            = 0:00:07.21
  eval_samples            =        240
  eval_samples_per_second =      33.28
  eval_steps_per_second   =      0.277
  perplexity              =        nan

aws-neuron / aws-neuron-samples

Loss NaN results for run_clm #20