Closed wpybtw closed 3 weeks ago
Hi @wpybtw , you can use the following patch to fix it temporarily.
From 06bc52864061da8a9583fe2ddcba38c1e58ee8ca Mon Sep 17 00:00:00 2001
From: Xiwen Yu <xiweny@nvidia.com>
Date: Wed, 30 Oct 2024 09:17:06 +0000
Subject: [PATCH] fix all_reduce benchmark
---
benchmarks/python/all_reduce.py | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/benchmarks/python/all_reduce.py b/benchmarks/python/all_reduce.py
index d91cdd0d4..92f332762 100644
--- a/benchmarks/python/all_reduce.py
+++ b/benchmarks/python/all_reduce.py
@@ -25,7 +25,8 @@ import tensorrt_llm as tllm
from tensorrt_llm import Mapping, Tensor
from tensorrt_llm._utils import OMPI_COMM_TYPE_HOST, mpi_comm
from tensorrt_llm.functional import AllReduceStrategy, allreduce
-from tensorrt_llm.plugin.plugin import current_all_reduce_helper
+from tensorrt_llm.plugin.plugin import (current_all_reduce_helper,
+ init_all_reduce_helper)
def allreduce_benchmark(dtype: str,
@@ -41,7 +42,7 @@ def allreduce_benchmark(dtype: str,
torch.cuda.set_device(local_rank)
cudart.cudaSetDevice(local_rank)
- mapping = Mapping(world_size, rank, gpus_per_node, world_size)
+ mapping = Mapping(world_size, rank, gpus_per_node, tp_size=world_size)
if world_size == 1:
raise RuntimeError("Benchmark must run with mpi_world_size > 1")
@@ -50,6 +51,7 @@ def allreduce_benchmark(dtype: str,
min_size, max_size, ratio = [int(i) for i in test_range.split(",")]
inner_loop = 1000
+ init_all_reduce_helper()
size = min_size
dtype_size = torch.finfo(torch_dtype).bits // 8
if mapping.rank == 0 and not no_header:
@@ -89,12 +91,7 @@ def allreduce_benchmark(dtype: str,
output.dtype = tllm.str_dtype_to_trt(dtype)
build_engine = EngineFromNetwork(
- (builder.trt_builder, net.trt_network),
- config=CreateConfig(
- fp16=(dtype == 'float16'),
- bf16=(dtype == 'bfloat16'),
- precision_constraints='obey',
- ))
+ (builder.trt_builder, net.trt_network), config=CreateConfig())
output = torch.zeros_like(input)
--
2.34.1
Great, it works. Thanks
System Info
Intel(R) Xeon(R) Platinum 8458P, H20 GPU *8 Nvidia pytorch docker Release 24.08 (build 107063150)
tensorrt 10.4.0 tensorrt-cu12 10.4.0 tensorrt-cu12-bindings 10.4.0 tensorrt-cu12-libs 10.4.0 tensorrt_llm 0.15.0.dev2024102200 typing_extensions 4.8.0
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Add init_all_reduce_helper() in l70. And run
mpirun -n 8 --allow-run-as-root python all_reduce.py
Expected behavior
Run allreduce
actual behavior
additional notes
None