NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.6k stars 255 forks source link

[Paddle][CUDAGraph] 175B GPT-3 Hybrid-Parallel Training with CUDAGraph #957

Closed eee4017 closed 14 hours ago

eee4017 commented 1 week ago

Description

In this PR, we introduce support for CUDAGraph in TE-PaddlePaddle. The primary issue with CUDAGraph is managing branching. Such as when enabling weight_cache, specific operations are required only in the first microbatch, but branching within CUDAGraph is undesirable.

Solutions to the Branching Problem in CUDAGraph

Solution 1: Utilizing Multiple Graphs (TE-PyTorch Solution)

TE-PyTorch addresses branching by recording separate graphs: one for the true branch and another for the false branch. This method necessitates maintaining distinct CUDA graphs for each microbatch. However, this solution can become complex, especially with the Pipeline Parallelism mechanism, as the number of required graphs doubles with each branching point. Managing 2^N graphs for N branchings can lead to significant challenges.

Solution 2: Reordering Kernel Sequences

To simplify the process, we reorder the kernel sequences outside the computational graph, keeping the branching outside the graph's scope. This approach avoids the complexity of managing multiple graphs and ensures efficient execution within CUDAGraph.

Changes Introduced in this PR

Type of Change

Checklist

Testing

Although we have not yet implemented unit tests, integration tests have been validated. The following parallelism configurations are supported:

  1. BF16: PP / TP / PP+TP / PP+TP+SP
  2. FP8: PP / TP / PP+TP / PP+TP+SP

Performance Testing on GPT-3 1.3B on 4 H100 GPUs

Model Configuration CUDAGraph Loss Samples/sec Speedup
1.3B, BF16, PP2+TP2 N 9.44 28.43 0.00%
1.3B, BF16, PP2+TP2 Y 9.44 55.32 94.54%
1.3B, BF16, PP2+TP2+SP N 9.44 24.38 0.00%
1.3B, BF16, PP2+TP2+SP Y 9.44 50.12 105.54%
1.3B, BF16, PP4 N 9.59 53.81 0.00%
1.3B, BF16, PP4 Y 9.59 68.80 27.87%
1.3B, BF16, TP4 N 9.69 14.54 0.00%
1.3B, BF16, TP4 Y 9.69 37.76 159.65%
1.3B, FP8, PP2+TP2 N 9.44 18.42 0.00%
1.3B, FP8, PP2+TP2 Y 9.44 55.84 203.06%
1.3B, FP8, PP2+TP2+SP N 9.47 16.26 0.00%
1.3B, FP8, PP2+TP2+SP Y 9.47 51.83 218.71%
1.3B, FP8, PP4 N 9.59 40.73 0.00%
1.3B, FP8, PP4 Y 9.59 76.31 87.37%

Performance Testing on GPT-3 175B with 64 H100 GPUs

We achieved convergence within the first 1024 steps on an 8-node cluster.

image

eee4017 commented 1 week ago

fp8 drawio (1)

We have two operations that has branching issue:

  1. FP8 Weight Casting in Linear Layers

    • This kernel has no data dependency with other operations, so it can be resolved by reordering it outside the CUDAGraph.
  2. Update Scale Inverse in Amax and Scale Update

    • The amax_and_scale_update_inplace kernel has a data dependency with previous kernels, making it unsuitable for reordering. Reordering would move many kernels out of the CUDAGraph, reducing the graph's coverage.
    • The only change here is the update_weight_scale_inv boolean parameter, which can be determined using the step ID. We only need to update update_weight_scale_inv. This can be addressed with the set parameter mechanism, where we reset this boolean each time the graph is launched.
timmoon10 commented 1 week ago

/te-ci paddle

timmoon10 commented 6 days ago

The CI is failing while building since it can't find glog:

In file included from /opt/transformerengine/transformer_engine/paddle/csrc/custom_ops.cu:13:
/usr/local/lib/python3.10/dist-packages/paddle/include/paddle/phi/backends/gpu/cuda/cuda_graph.h:30:10: fatal error: glog/logging.h: No such file or directory
   30 | #include "glog/logging.h"
zlsh80826 commented 2 days ago

/te-ci paddle

timmoon10 commented 1 day ago

/te-ci paddle

zlsh80826 commented 1 day ago

/te-ci paddle