eee4017 commented 1 week ago

Description

In this PR, we introduce support for CUDAGraph in TE-PaddlePaddle. The primary issue with CUDAGraph is managing branching. Such as when enabling weight_cache, specific operations are required only in the first microbatch, but branching within CUDAGraph is undesirable.

Solutions to the Branching Problem in CUDAGraph

Solution 1: Utilizing Multiple Graphs (TE-PyTorch Solution)

TE-PyTorch addresses branching by recording separate graphs: one for the true branch and another for the false branch. This method necessitates maintaining distinct CUDA graphs for each microbatch. However, this solution can become complex, especially with the Pipeline Parallelism mechanism, as the number of required graphs doubles with each branching point. Managing 2^N graphs for N branchings can lead to significant challenges.

Solution 2: Reordering Kernel Sequences

To simplify the process, we reorder the kernel sequences outside the computational graph, keeping the branching outside the graph's scope. This approach avoids the complexity of managing multiple graphs and ensures efficient execution within CUDAGraph.

Changes Introduced in this PR

Support for random seed updates (set_rng_state) using the set parameter mechanism in Paddle.
Support for amax_and_scale_update_inplace with the set parameter mechanism in Paddle. This kernel is manually issued, preserving the legacy kernel for now while we explore other solutions.
Reordering FP8 weight cache before the PP pipe to exclude it from the CUDAGraph.

Type of Change

[ ] Documentation change (changes only to documentation, either a fix or new content)
[ ] Bug fix (non-breaking change that fixes an issue)
[x] New feature (non-breaking change that adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infrastructure/Build change
[ ] Code refactor

Checklist

[x] I have read and followed the contributing guidelines
[x] The functionality is complete
[x] I have commented on my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

Testing

Although we have not yet implemented unit tests, integration tests have been validated. The following parallelism configurations are supported:

BF16: PP / TP / PP+TP / PP+TP+SP
FP8: PP / TP / PP+TP / PP+TP+SP

Performance Testing on GPT-3 1.3B on 4 H100 GPUs

Model Configuration	CUDAGraph	Loss	Samples/sec	Speedup
1.3B, BF16, PP2+TP2	N	9.44	28.43	0.00%
1.3B, BF16, PP2+TP2	Y	9.44	55.32	94.54%
1.3B, BF16, PP2+TP2+SP	N	9.44	24.38	0.00%
1.3B, BF16, PP2+TP2+SP	Y	9.44	50.12	105.54%
1.3B, BF16, PP4	N	9.59	53.81	0.00%
1.3B, BF16, PP4	Y	9.59	68.80	27.87%
1.3B, BF16, TP4	N	9.69	14.54	0.00%
1.3B, BF16, TP4	Y	9.69	37.76	159.65%
1.3B, FP8, PP2+TP2	N	9.44	18.42	0.00%
1.3B, FP8, PP2+TP2	Y	9.44	55.84	203.06%
1.3B, FP8, PP2+TP2+SP	N	9.47	16.26	0.00%
1.3B, FP8, PP2+TP2+SP	Y	9.47	51.83	218.71%
1.3B, FP8, PP4	N	9.59	40.73	0.00%
1.3B, FP8, PP4	Y	9.59	76.31	87.37%

Performance Testing on GPT-3 175B with 64 H100 GPUs

We achieved convergence within the first 1024 steps on an 8-node cluster.

Performance Gain
- BF16: Improved from 11.1 to 11.5, achieving a speedup of 1.03x.
- FP8: Improved from 13.8 to 16.3, achieving a speedup of 1.18x.

eee4017 commented 1 week ago

fp8 drawio (1)

We have two operations that has branching issue:

FP8 Weight Casting in Linear Layers
- This kernel has no data dependency with other operations, so it can be resolved by reordering it outside the CUDAGraph.
Update Scale Inverse in Amax and Scale Update
- The amax_and_scale_update_inplace kernel has a data dependency with previous kernels, making it unsuitable for reordering. Reordering would move many kernels out of the CUDAGraph, reducing the graph's coverage.
- The only change here is the update_weight_scale_inv boolean parameter, which can be determined using the step ID. We only need to update update_weight_scale_inv. This can be addressed with the set parameter mechanism, where we reset this boolean each time the graph is launched.

timmoon10 commented 1 week ago

/te-ci paddle

timmoon10 commented 6 days ago

The CI is failing while building since it can't find glog:

In file included from /opt/transformerengine/transformer_engine/paddle/csrc/custom_ops.cu:13:
/usr/local/lib/python3.10/dist-packages/paddle/include/paddle/phi/backends/gpu/cuda/cuda_graph.h:30:10: fatal error: glog/logging.h: No such file or directory
   30 | #include "glog/logging.h"

zlsh80826 commented 2 days ago

/te-ci paddle

timmoon10 commented 1 day ago

/te-ci paddle

zlsh80826 commented 1 day ago

/te-ci paddle

NVIDIA / TransformerEngine

[Paddle][CUDAGraph] 175B GPT-3 Hybrid-Parallel Training with CUDAGraph #957

Description

Solutions to the Branching Problem in CUDAGraph

Solution 1: Utilizing Multiple Graphs (TE-PyTorch Solution)

Solution 2: Reordering Kernel Sequences

Changes Introduced in this PR

Type of Change

Checklist

Testing

Performance Testing on GPT-3 1.3B on 4 H100 GPUs

Performance Testing on GPT-3 175B with 64 H100 GPUs