Closed liuhatry closed 16 hours ago
Hi @liuhatry — the error message you’re seeing is accurate. Comm+GEMM overlap uses CUDA Multicast operations that are not supported on driver version 525. I believe you need at least 535 for these (this is also the driver version officially paired with CUDA 12.2).
The underlying Userbuffers communicator has a CUDA IPC code path for older platforms but it’s not supported in TE/main at the moment. If you can’t update your system to a newer driver, I can try to provide a test branch tomorrow or early next week that enables support for this fallback option.
Also important: comm overlap in TE requires participating devices to be on the same NVLink interconnect. This is a pure hardware prerequisite independent of any driver or CUDA Toolkit version.
535 solved the problem
Machine
NVIDIA H800 NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2
SoftWare
torch 2.1.1 transformer-engine 1.9.0.dev0+56e0b35
Reproduce:
python3 -m torch.distributed.run --nproc-per-node=8 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000
LOG
Traceback (most recent call last): File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 166, in
train(args) File "examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py", line 98, in train
te.initialize_ub(
File "/usr/local/python/lib/python3.8/site-packages/transformer_engine/pytorch/module/base.py", line 269, in initialize_ub
add_ub(
File "/usr/local/python/lib/python3.8/site-packages/transformer_engine/pytorch/module/base.py", line 181, in add_ub
ub_obj = tex.UbufP2PCommOverlap(
RuntimeError: /jizhicfs/macroliu/ptm_code/tpoverlap_pai/TransformerEngine_official/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported