feat(fp8): [Work In Progress] enable FP8 training - Githubissues

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

https://internevo.readthedocs.io/zh-cn/latest/?badge=latest

Apache License 2.0

310 stars 52 forks source link

feat(fp8): [Work In Progress] enable FP8 training #369

Open zigzagcai opened 2 weeks ago

zigzagcai commented 2 weeks ago

Motivation

Try to enable FP8 to speedup training on Hooper platform, via torchao.

Modification

internlm/core internlm/quantization

BC-breaking (Optional)

None

Use cases (Optional)

None

Checklist

Before PR:

[ ] Pre-commit or other linting tools are used to fix the potential lint issues.
[ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
[ ] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
[ ] The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

[ ] If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
[ ] CLA has been signed and all committers have signed the CLA in this PR.

zigzagcai commented 2 weeks ago

Current Status:

[x] Workable with FP8 dynamic scaling + torch.compile, delivering 10%~15% throughput speedup. In the test case of 7B InternLM1 model (layers was reduced from 32 to 16, to make it runnable on 1 node with 2x GPUs) on 2x H100 GPUs
[x] Workable with FP8 delayed scaling (input, weight) + torch.compile, also delivering 10%~15% throughput speedup
[ ] Meet compilation error with FP8 delayed scaling (input, weight, grad) + torch.compile, seems to be some error with torch compiler in the backward pass of all_reduce amax values, currently waiting for the response from torchao developers or trying on my own to solve this issue.