Open OrenLeung opened 1 month ago
@OrenLeung This issue was due to that our dev branch does not have all the recent optimizations on DDP and FSDP from NVTE yet. We have a PR in review that would be merged soon that could resolve this issue (https://github.com/ROCm/TransformerEngine/pull/66). Here are the numbers that I got with this PR:
8xMI300X DDP FP8 TE (batch size 28): 315TFLOPs 8xMI300X DDP FP8 TE (batch size 32): 318TFLOPs 8xMI300X DDP FP8 TE (batch size 42): 324TFLOPs
@OrenLeung This issue was due to that our dev branch does not have all the recent optimizations on DDP and FSDP from NVTE yet. We have a PR in review that would be merged soon that could resolve this issue (#66). Here are the numbers that I got with this PR:
8xMI300X DDP FP8 TE (batch size 28): 315TFLOPs 8xMI300X DDP FP8 TE (batch size 32): 318TFLOPs 8xMI300X DDP FP8 TE (batch size 42): 324TFLOPs
Hi @wenchenvincent ,
Thanks for looking into this. This results look much better. Can you provide the Dockerfile
instructions for me to reproduce these results?
To be competitive with H100 on a perf per TCO basis, MI300x needs to hit 398 TFLOP/s/GPU. Any other PRs or optimizations you have in the pipeline?
cc: @hliuca
Here is my Nvidia preliminary results for gpt2-1.5B fp8 full training:
Full Response in the llama3 70B proxy gh issue https://github.com/ROCm/TransformerEngine/issues/78#issuecomment-2418538437
after #66 merged to main, i now get 322TFLOP/s/GPU on this model in our internal codebase
After 32 Warmup: Mean TFLOP/s: 322.79 Mean MFU: 12.37%
similar to @wenchenvincent 's TFLOP/s
Problem Description
Even with
NVTE_USE_HIPBLASLT=1
& Installing TE while inside the container instead of throughDockerfile
as suggested by https://github.com/ROCm/TransformerEngine/issues/74#issuecomment-2414845971, FP8 is 25% slower than BF16. Furthermore, it even OOMs on the same batch size as what bf16 can fit. On Nvidia H100 Transformer Engine, usually I can even fit more batches than bf16 & it never OOMs on the same batch size.The command to run this is
python ./train_gpt_ddp_reprod.py
using the reprod script & TE Install Instructions Below.cc: @hliuca
preliminary Results TFLOP/s/GPU:
on this model with DDP, H100 saw an 16% increase in TFLOP/s/GPU from using FP8.
Operating System
Ubuntu
CPU
AMD CPU
GPU
MI300X
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Docker Image
TE install Instructions (done inside docker container)
Reprod Script
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response