NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

initialize_ub failed: transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported #991

Closed liuhatry closed 16 hours ago

liuhatry commented 3 days ago

Machine

NVIDIA H800 NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2

SoftWare

torch 2.1.1 transformer-engine 1.9.0.dev0+56e0b35

Reproduce:

python3 -m torch.distributed.run --nproc-per-node=8 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

LOG

Traceback (most recent call last): File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 166, in train(args) File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 98, in train te.initialize_ub( File "/usr/local/python/lib/python3.8/site-packages/transformer_engine/pytorch/module/base.py", line 269, in initialize_ub add_ub( File "/usr/local/python/lib/python3.8/site-packages/transformer_engine/pytorch/module/base.py", line 181, in add_ub ub_obj = tex.UbufP2PCommOverlap( RuntimeError: /jizhicfs/macroliu/ptm_code/tpoverlap_pai/TransformerEngine_official/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported

denera commented 3 days ago

Hi @liuhatry — the error message you’re seeing is accurate. Comm+GEMM overlap uses CUDA Multicast operations that are not supported on driver version 525. I believe you need at least 535 for these (this is also the driver version officially paired with CUDA 12.2).

The underlying Userbuffers communicator has a CUDA IPC code path for older platforms but it’s not supported in TE/main at the moment. If you can’t update your system to a newer driver, I can try to provide a test branch tomorrow or early next week that enables support for this fallback option.

Also important: comm overlap in TE requires participating devices to be on the same NVLink interconnect. This is a pure hardware prerequisite independent of any driver or CUDA Toolkit version.

liuhatry commented 16 hours ago

535 solved the problem