bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
10.12k stars 641 forks source link

bug: Can openllm run on k8s clusters without GPUs? #1078

Open Lucas-16 opened 2 months ago

Lucas-16 commented 2 months ago

Describe the bug

I want to run Qwen0.5b on a k8s cluster without GPU, but the service startup has failed so far. Is there any way to support CPU machines ![Uploading 屏幕截图 2024-09-09 164657.jpg…]()

To reproduce

No response

Logs

No response

Environment

only have CPU

System information (Optional)

No response

Lucas-16 commented 2 months ago

Traceback (most recent call last): File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/starlette/routing.py", line 732, in lifespan async with self.lifespan_context(app) as maybe_state: File "/root/anaconda3/envs/openllm/lib/python3.9/contextlib.py", line 181, in aenter return await self.gen.anext() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 74, in lifespan await on_startup() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/_bentoml_impl/server/app.py", line 275, in create_instance self._service_instance = self.service() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/_bentoml_sdk/service/factory.py", line 257, in call instance = self.inner() File "/root/.openllm/repos/github.com/bentoml/openllm-models/main/bentoml/bentos/qwen2/0.5b-instruct-fp16-33df/src/service.py", line 99, in init self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args engine = cls( File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 380, in init self.engine = self._init_engine(*args, kwargs) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine return engine_class(*args, kwargs) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 251, in init self.model_executor = executor_class( File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 34, in _init_executor self.driver_worker = self._create_worker() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 85, in _create_worker return create_worker(self._get_create_worker_kwargs( File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 20, in create_worker wrapper.init_worker(*kwargs) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 367, in init_worker self.worker = worker_class(args, kwargs) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/worker/worker.py", line 90, in init self.model_runner: GPUModelRunnerBase = ModelRunnerClass( File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 651, in init self.attn_backend = get_attn_backend( File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/attention/selector.py", line 46, in get_attn_backend backend = which_attn_to_use(num_heads, head_size, num_kv_heads, File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/attention/selector.py", line 149, in which_attn_to_use if current_platform.get_device_capability()[0] < 8: File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/platforms/cuda.py", line 49, in get_device_capability return get_physical_device_capability(physical_device_id) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/vllm/platforms/cuda.py", line 18, in wrapper pynvml.nvmlInit() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/pynvml.py", line 1793, in nvmlInit nvmlInitWithFlags(0) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/pynvml.py", line 1776, in nvmlInitWithFlags _LoadNvmlLibrary() File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/pynvml.py", line 1823, in _LoadNvmlLibrary _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND) File "/root/.openllm/venv/397201824397438346/lib/python3.9/site-packages/pynvml.py", line 855, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

aarnphm commented 2 months ago

maybe you can try the llamacpp models, but by default vllm requires GPU to be available.

bojiang commented 1 month ago

All models supported by openllm today requires Nvidia GPU or Apple silicon to run. We may add more options in the future, or you can contribute to https://github.com/bentoml/OpenLLM-models