A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
I try to implement a multi-stage training with fp8 autocast. However, when I load checkpoint from first training stage using torch's load_state_dict(...), loss quickly explodes.
Are there any global fp8 states that also need to be saved/restored?
Optimizer's state and grad scaler state are restored. Such behavior is not reproduced with amp fp16 training
Hello!
I try to implement a multi-stage training with fp8 autocast. However, when I load checkpoint from first training stage using torch's
load_state_dict(...)
, loss quickly explodes.Are there any global fp8 states that also need to be saved/restored?
Optimizer's state and grad scaler state are restored. Such behavior is not reproduced with amp fp16 training