[PyTorch] How to restore fp8 amp training from checkpoint

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.61k stars 256 forks source link

[PyTorch] How to restore fp8 amp training from checkpoint #982

Open alexdremov opened 6 days ago

alexdremov commented 6 days ago

Hello!

I try to implement a multi-stage training with fp8 autocast. However, when I load checkpoint from first training stage using torch's load_state_dict(...), loss quickly explodes.

Are there any global fp8 states that also need to be saved/restored?

Optimizer's state and grad scaler state are restored. Such behavior is not reproduced with amp fp16 training