MeetKai / functionary

Chat language model that can use tools and interpret the results
MIT License
1.37k stars 107 forks source link

server_vllm.py doesn't run, multiprocessing errors #240

Closed localmind-ai closed 1 month ago

localmind-ai commented 1 month ago

First of all, thanks for the great new release of Functionary Medium 3.1 based on Llama 3.1 70B! Looking forward to try that one out.

Unfortunately, we have an issue when running the server_vllm.py we get some CUDA multiprocessing errors that don't appear in regular vLLM (tested on same server).

We launch with python3 server_vllm.py --model "meetkai/functionary-medium-v3.1" --host 0.0.0.0 --port 8080 --tensor-parallel-size 8 and this is the full error log when launching the script:

(VllmWorkerProcess pid=143186) Process VllmWorkerProcess:
(VllmWorkerProcess pid=143186) Traceback (most recent call last):
(VllmWorkerProcess pid=143186)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=143186)     self.run()
(VllmWorkerProcess pid=143186)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=143186)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=143186)     worker = worker_factory()
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 20, in create_worker
(VllmWorkerProcess pid=143186)     wrapper.init_worker(**kwargs)
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 367, in init_worker
(VllmWorkerProcess pid=143186)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 90, in __init__
(VllmWorkerProcess pid=143186)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 651, in __init__
(VllmWorkerProcess pid=143186)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 55, in get_attn_backend
(VllmWorkerProcess pid=143186)     from vllm.attention.backends.xformers import (  # noqa: F401
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 6, in <module>
(VllmWorkerProcess pid=143186)     from xformers import ops as xops
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/xformers/ops/__init__.py", line 8, in <module>
(VllmWorkerProcess pid=143186)     from .fmha import (
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 12, in <module>
(VllmWorkerProcess pid=143186)     from . import ck, ck_decoder, ck_splitk, cutlass, decoder, flash, small_k, triton_splitk
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/triton_splitk.py", line 89, in <module>
(VllmWorkerProcess pid=143186)     if TYPE_CHECKING or _is_triton_available():
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/xformers/__init__.py", line 38, in func_wrapper
(VllmWorkerProcess pid=143186)     value = func()
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/xformers/__init__.py", line 54, in _is_triton_available
(VllmWorkerProcess pid=143186)     if torch.cuda.get_device_capability("cuda") < (8, 0):
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=143186)     prop = get_device_properties(device)
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=143186)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=143186)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=143186)     raise RuntimeError(
(VllmWorkerProcess pid=143186) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 08-09 13:17:14 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 143186 died, exit code: 1
INFO 08-09 13:17:14 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 169, in _enqueue_task
    self._task_queue.put((task_id, method, args, kwargs))
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 88, in put
    raise ValueError(f"Queue {self!r} is closed")
ValueError: Queue <multiprocessing.queues.Queue object at 0x7f0fc9953fd0> is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/functionary/server_vllm.py", line 139, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 201, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 123, in _init_executor
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 172, in _run_workers
    worker_outputs = [
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 173, in <listcomp>
    worker.execute_method(method, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 176, in execute_method
    self._enqueue_task(future, method, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 172, in _enqueue_task
    raise ChildProcessError("worker died") from e
ChildProcessError: worker died
khai-meetkai commented 1 month ago

Hi @localmind-ai, this issue is from vllm, to run on multiple GPUs you need to run this command first: export VLLM_WORKER_MULTIPROC_METHOD=spawn

You can get more detail from this: https://github.com/vllm-project/vllm/issues/6152

localmind-ai commented 1 month ago

Ah, thanks a lot @khai-meetkai - will try that out!