aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

Loss NaN results for run_clm #20

Open modestcigit opened 11 months ago

modestcigit commented 11 months ago

Got the run_clm.py to compile on trn1.32xlarge and also run the actual training. However, it shows loss-NaN and perplexily NaN results. has this been observed? The directions I followed are from here

/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/numpy/core/_methods.py:178: RuntimeWarning: invalid value encountered in reduce
  ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
^M100%|██████████| 2/2 [00:00<00:00,  2.55it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_loss               =        nan
  eval_runtime            = 0:00:07.21
  eval_samples            =        240
  eval_samples_per_second =      33.28
  eval_steps_per_second   =      0.277
  perplexity              =        nan
aws-donkrets commented 6 months ago

Hi modestcigit - seems this issue has been open for a while without a response. If you are still interested in getting this model to work on a trn1 instance I would suggest two things: 1) we make approximately monthly Neuron SDK releases so download the latest version to see if you can reproduce the issue; 2) if the issue is still seen in the latest release then I would suggest trying the --enable-saturate-infinity compiler flag when compiling your model.