[Bug]: Using tensor parallel during offline inference causes the process to hang

Your current environment

docker: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest branch: habana_main

🐛 Describe the bug

I attempted to use the offline inference method to run the meta-llama/Meta-Llama-3.1-70B-Instruct model. However, after the program starts running, it hangs for a long time and then throws the following error. I can run the model in the form of an OpenAI server with tensor parallel.

(RayWorkerWrapper pid=18930) *** SIGABRT received at time=1724981511 on cpu 144 ***
(RayWorkerWrapper pid=18930) PC: @     0x7fa8c2bd89fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=18930)     @     0x7fa8c2b84520  (unknown)  (unknown)
(RayWorkerWrapper pid=18930) [2024-08-30 01:31:51,420 E 18930 20862] logging.cc:440: *** SIGABRT received at time=1724981511 on cpu 144 ***
(RayWorkerWrapper pid=18930) [2024-08-30 01:31:51,420 E 18930 20862] logging.cc:440: PC: @     0x7fa8c2bd89fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=18930) [2024-08-30 01:31:51,420 E 18930 20862] logging.cc:440:     @     0x7fa8c2b84520  (unknown)  (unknown)
(RayWorkerWrapper pid=18930) Fatal Python error: Aborted

Here is my code:

import os
import logging
from vllm import LLM, SamplingParams

os.environ["PT_HPU_ENABLE_LAZY_COLLECTIVES"] = "true"

prompts = [
    "The president of the United States is",
    "The capital of France is",
]

sampling_params = SamplingParams(n=1, temperature=0, max_tokens=2000)
llm = LLM(model="meta-llama/Meta-Llama-3.1-70B-Instruct", block_size=128, dtype="bfloat16", tensor_parallel_size=8)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(generated_text)

HabanaAI / vllm-fork

[Bug]: Using tensor parallel during offline inference causes the process to hang #220

Your current environment

🐛 Describe the bug