bytedance / flux

A fast communication-overlapping library for tensor parallelism on GPUs.
Apache License 2.0
223 stars 17 forks source link

[BUG] Illegal memory access when fuse_reduction=False #10

Closed tlrmchlsmth closed 3 months ago

tlrmchlsmth commented 4 months ago

Describe the bug I'm hitting an illegal memory access in https://github.com/vllm-project/vllm/pull/5917 when setting fuse_reduction=False in the fused GEMM+ReduceScatter kernel.

To Reproduce Clone https://github.com/vllm-project/vllm/pull/5917 and then apply this patch:

diff --git a/vllm/model_executor/layers/linear.py b/vllm/model_executor/layers/linear.py
index aa45cf98..adad2df6 100644
--- a/vllm/model_executor/layers/linear.py
+++ b/vllm/model_executor/layers/linear.py
@@ -852,7 +852,7 @@ class FluxRowParallelLinear(LinearBase):
             # Note: bfloat16 requires fuse_reduction=False.
             # When fuse_reduction=False, I encounter illegal memory accesses in
             # the kernel, which are hard to track down.
-            fuse_reduction=True,
+            fuse_reduction=False,
         )

         # Divide the weight matrix along the last dimension.

Then run:

python3 benchmarks/benchmark_latency.py --model meta-llama/Meta-Llama-3-8B-Instruct --num-iters 100 --batch-size 1 --input-len 512 --output-len 1 --enforce-eager --tensor-parallel-size 2 --dtype float16

Unfortunately, I haven't been able to reproduce this with a minimal example. I also haven't been able to reproduce the problem when running with compute-sanitizer. Some problem sizes work, and some don't (--input-len 1024 seems to work OK but not --input-len 512 for instance).

Stack trace/logs

(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`, Traceback (most recent call last):
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/worker/worker_base.py", line 63, in start_worker_execution_loop
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/worker/worker_base.py", line 255, in execute_model
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(model_input, self.kv_cache)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/worker/model_runner.py", line 994, in execute_model
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states = model_executable(
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 378, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 292, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states, residual = layer(
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 241, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 82, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 300, in forward
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 113, in apply
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]     return F.linear(x, weight, bias)
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
(VllmWorkerProcess pid=1869017) ERROR 06-27 18:05:15 multiproc_worker_utils.py:226]
Warmup iterations:   0%|                                                                                                                                                                                                                                                       | 0/10 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/tms/nm-vllm/benchmarks/benchmark_latency.py", line 280, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/tms/nm-vllm/benchmarks/benchmark_latency.py", line 91, in main
[rank0]:     run_to_completion(profile_dir=None)
[rank0]:   File "/home/tms/nm-vllm/benchmarks/benchmark_latency.py", line 82, in run_to_completion
[rank0]:     llm.generate(dummy_inputs,
[rank0]:   File "/home/tms/nm-vllm/vllm/utils.py", line 764, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/entrypoints/llm.py", line 304, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/home/tms/nm-vllm/vllm/entrypoints/llm.py", line 556, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/home/tms/nm-vllm/vllm/engine/llm_engine.py", line 806, in step
[rank0]:     output = self.model_executor.execute_model(
[rank0]:   File "/home/tms/nm-vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
[rank0]:     return self._driver_execute_model(execute_model_req)
[rank0]:   File "/home/tms/nm-vllm/vllm/executor/multiproc_gpu_executor.py", line 88, in _driver_execute_model
[rank0]:     return self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/home/tms/nm-vllm/vllm/worker/worker_base.py", line 255, in execute_model
[rank0]:     output = self.model_runner.execute_model(model_input, self.kv_cache)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/worker/model_runner.py", line 994, in execute_model
[rank0]:     hidden_states = model_executable(
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 378, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 292, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 241, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/models/llama.py", line 82, in forward
[rank0]:     gate_up, _ = self.gate_up_proj(x)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 300, in forward
[rank0]:     output_parallel = self.quant_method.apply(self, input_, bias)
[rank0]:   File "/home/tms/nm-vllm/vllm/model_executor/layers/linear.py", line 113, in apply
[rank0]:     return F.linear(x, weight, bias)
[rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6ee92897 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7d6ee42b25 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7d6ef6a718 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7d22c4ae36 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f7d22c4ef38 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f7d22c545ac in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d22c5531c in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f7d6e6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f7e10fddac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f7e1106f850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6ee92897 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7d6ee42b25 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7d6ef6a718 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7d22c4ae36 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f7d22c4ef38 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f7d22c545ac in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7d22c5531c in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f7d6e6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f7e10fddac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f7e1106f850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7d6ee92897 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f7d228d7e33 in /home/tms/nm-vllm/flux_experiment/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f7d6e6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f7e10fddac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f7e1106f850 in /lib/x86_64-linux-gnu/libc.so.6)

/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[1]    1868949 IOT instruction (core dumped)  python3 benchmarks/benchmark_latency.py --model  --num-iters 100 --batch-size
zheng-ningxin commented 4 months ago

Thank you very much for your feedback @tlrmchlsmth . I was unable to reproduce this bug using the latest commit (nm-vllm: e556f59 flux: c866c438). The command I ran is:

python3 benchmarks/benchmark_latency.py --model /opt/tiger/Meta-Llama-3-8B-Instruct --num-iters 100 --batch-size 1 --input-len 2048 --output-len 1 --enforce-eager --tensor-parallel-size 4 --dtype float16

Could it be an environment-related issue?

zheng-ningxin commented 4 months ago

I change the sequence length to 512, still not be able to reproduce the bug. python3 benchmarks/benchmark_latency.py --model /home/tiger/Meta-Llama-3-8B-Instruct --num-iters 100 --batch-size 1 --input-len 512 --output-len 1 --enforce-eager --tensor-parallel-size 2 --dtype float16

wenlei-bao commented 4 months ago

@zheng-ningxin Let's maybe wait for @tlrmchlsmth provide the docker to reproduce as mentioned in the other thread.

tlrmchlsmth commented 4 months ago

I made a docker to repro the issue, but all tests pass there. I’ll keep you posted.

On Wed, Jul 17, 2024 at 11:13 PM Wenlei Bao @.***> wrote:

@zheng-ningxin https://github.com/zheng-ningxin Let's maybe wait for @tlrmchlsmth https://github.com/tlrmchlsmth provide the docker to reproduce as mentioned in the other thread.

— Reply to this email directly, view it on GitHub https://github.com/bytedance/flux/issues/10#issuecomment-2235226265, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJN747YE377PC7EMQ3RR4DZM4XEFAVCNFSM6AAAAABKAOA5RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVGIZDMMRWGU . You are receiving this because you were mentioned.Message ID: @.***>

tlrmchlsmth commented 4 months ago

I am no longer able to reproduce the issue at all on Flux's main. I've updated my vllm PR and am now seeing speedup vs main :boom: