HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
34 stars 37 forks source link

[Bug]: collective nonSFG is not supported during hpu graph capturing #192

Closed xinsu626 closed 2 weeks ago

xinsu626 commented 1 month ago

Your current environment

I am using the following Docker image: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest.

🐛 Describe the bug

On the main branch of the vllm-fork repository, I attempted to run the "meta-llama/Meta-Llama-3-70B" model using the following code:

from vllm import LLM, SamplingParams
import sys
import os
os.environ['PT_HPU_LAZY_MODE'] = '1'

prompts = [
    "The president of the United States is",
    "The capital of France is",
]

sampling_params = SamplingParams(n=1, temperature=0, max_tokens=30)
llm = LLM(model="meta-llama/Meta-Llama-3-70B", max_num_seqs=32, tensor_parallel_size=8)
outputs = llm.generate(prompts, sampling_params)

However, I encountered the following error:

(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2220, in all_reduce [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382]     work = group.allreduce([tensor], opts) [repeated 6x across cluster]
(RayWorkerWrapper pid=26165) ERROR 08-16 05:39:57 worker_base.py:382] RuntimeError: collective nonSFG is not supported during hpu graph capturing [repeated 6x across cluster]
kdamaszk commented 4 weeks ago

Hi @xinsu626, please set this variable: PT_HPU_ENABLE_LAZY_COLLECTIVES=true. It is required to make HPU graphs working with tensor parallelism. Please check: Environment variables

xinsu626 commented 2 weeks ago

Hi @xinsu626, please set this variable: PT_HPU_ENABLE_LAZY_COLLECTIVES=true. It is required to make HPU graphs working with tensor parallelism. Please check: Environment variables

@kdamaszk Got it. Thank you for your help!

m9e commented 6 days ago

Is this functionally the same as PT_HPU_LAZY_MODE being set? (eg, per the readme worning, should only be set with eager mode?)