I attempted to use the offline inference method to run the meta-llama/Meta-Llama-3.1-70B-Instruct model. However, after the program starts running, it hangs for a long time and then throws the following error. I can run the model in the form of an OpenAI server with tensor parallel.
(RayWorkerWrapper pid=18930) *** SIGABRT received at time=1724981511 on cpu 144 ***
(RayWorkerWrapper pid=18930) PC: @ 0x7fa8c2bd89fc (unknown) pthread_kill
(RayWorkerWrapper pid=18930) @ 0x7fa8c2b84520 (unknown) (unknown)
(RayWorkerWrapper pid=18930) [2024-08-30 01:31:51,420 E 18930 20862] logging.cc:440: *** SIGABRT received at time=1724981511 on cpu 144 ***
(RayWorkerWrapper pid=18930) [2024-08-30 01:31:51,420 E 18930 20862] logging.cc:440: PC: @ 0x7fa8c2bd89fc (unknown) pthread_kill
(RayWorkerWrapper pid=18930) [2024-08-30 01:31:51,420 E 18930 20862] logging.cc:440: @ 0x7fa8c2b84520 (unknown) (unknown)
(RayWorkerWrapper pid=18930) Fatal Python error: Aborted
Here is my code:
import os
import logging
from vllm import LLM, SamplingParams
os.environ["PT_HPU_ENABLE_LAZY_COLLECTIVES"] = "true"
prompts = [
"The president of the United States is",
"The capital of France is",
]
sampling_params = SamplingParams(n=1, temperature=0, max_tokens=2000)
llm = LLM(model="meta-llama/Meta-Llama-3.1-70B-Instruct", block_size=128, dtype="bfloat16", tensor_parallel_size=8)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(generated_text)
Your current environment
docker: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest branch: habana_main
🐛 Describe the bug
I attempted to use the offline inference method to run the
meta-llama/Meta-Llama-3.1-70B-Instruct
model. However, after the program starts running, it hangs for a long time and then throws the following error. I can run the model in the form of an OpenAI server with tensor parallel.Here is my code: