Open YYY-MMW opened 7 months ago
此外如果我把fp16设置为True会得到
Traceback (most recent call last):
File "tasks/train_it.py", line 213, in <module>
main(cfg)
File "tasks/train_it.py", line 161, in main
global_step = train(
File "tasks/train_it.py", line 67, in train
scaler.step(optimizer)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 446, in step
self.unscale_(optimizer)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 336, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 258, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
可以试试把语言模型设置成bf16,混合精度数据类型也改成bf16,fp16在某些情况容易nan
此外如果我把fp16设置为True会得到
Traceback (most recent call last): File "tasks/train_it.py", line 213, in <module> main(cfg) File "tasks/train_it.py", line 161, in main global_step = train( File "tasks/train_it.py", line 67, in train scaler.step(optimizer) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 446, in step self.unscale_(optimizer) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 336, in unscale_ optimizer_state["found_inf_per_device"] = self._unscale_grads_( File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 258, in _unscale_grads_ raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients.
我也有一样的问题,你解决了吗
可以试试把语言模型设置成bf16,混合精度数据类型也改成bf16,fp16在某些情况容易nan
According to the author's message, after changing to bf16 the loss will no longer be nan.
torch_dtype=torch.bfloat16
这个我stage3配置信息
然后我得到错误,loss为nan