Open zigzagcai opened 2 weeks ago
Current Status:
FP8 dynamic scaling
+ torch.compile
, delivering 10%~15% throughput speedup. In the test case of 7B InternLM1 model (layers was reduced from 32 to 16, to make it runnable on 1 node with 2x GPUs) on 2x H100 GPUsFP8 delayed scaling (input, weight)
+ torch.compile
, also delivering 10%~15% throughput speedupFP8 delayed scaling (input, weight, grad)
+ torch.compile
, seems to be some error with torch compiler in the backward pass of all_reduce amax
values, currently waiting for the response from torchao developers or trying on my own to solve this issue.
Motivation
Try to enable FP8 to speedup training on Hooper platform, via
torchao
.Modification
internlm/core
internlm/quantization
BC-breaking (Optional)
None
Use cases (Optional)
None
Checklist
Before PR:
After PR: