NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Apache License 2.0

1.61k stars 256 forks source link

Machine

NVIDIA H800 NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2

SoftWare

torch 2.1.1 transformer-engine 1.9.0.dev0+56e0b35

Reproduce:

python3 -m torch.distributed.run --nproc-per-node=8 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

LOG

Traceback (most recent call last): File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 166, in train(args) File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 98, in train te.initialize_ub( File "/usr/local/python/lib/python3.8/site-packages/transformer_engine/pytorch/module/base.py", line 269, in initialize_ub add_ub( File "/usr/local/python/lib/python3.8/site-packages/transformer_engine/pytorch/module/base.py", line 181, in add_ub ub_obj = tex.UbufP2PCommOverlap( RuntimeError: /jizhicfs/macroliu/ptm_code/tpoverlap_pai/TransformerEngine_official/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported

Hi @liuhatry — the error message you’re seeing is accurate. Comm+GEMM overlap uses CUDA Multicast operations that are not supported on driver version 525. I believe you need at least 535 for these (this is also the driver version officially paired with CUDA 12.2).

The underlying Userbuffers communicator has a CUDA IPC code path for older platforms but it’s not supported in TE/main at the moment. If you can’t update your system to a newer driver, I can try to provide a test branch tomorrow or early next week that enables support for this fallback option.

Also important: comm overlap in TE requires participating devices to be on the same NVLink interconnect. This is a pure hardware prerequisite independent of any driver or CUDA Toolkit version.

NVIDIA / TransformerEngine

initialize_ub failed: transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported #991

Machine

SoftWare

Reproduce:

LOG