bentoml / OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
9.73k stars 619 forks source link

/v1/chat/completions endpoint not responding - ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster. #758

Closed cay89 closed 2 months ago

cay89 commented 9 months ago

I'm running zephyr-7b-alpha in a Docker container on Windows 10 with two RTX 3070 GPUs. However, when I try to make a call with the original example request on the /v1/chat/completions endpoint, it seems to run forever.

image

What could be the problem?

aarnphm commented 9 months ago

What are the logs?

cay89 commented 9 months ago

I don't know... I just ran for minutes and it didn't write anything. However, I started to set it up because it seemed to be using only one GPU. I set the --workers-per-resource to 0.5 so my ENTRYPOINT is like this:

ENTRYPOINT openllm start HuggingFaceH4/zephyr-7b-alpha --backend vllm --gpu-memory-utilization 0.9 --workers-per-resource 0.5 --development --debug

Running with:

docker run --gpus all --publish 3000:3000 --shm-size 10gb cay89/aip-backend-llm:openllm-zephyr-7b-alpha

The result is an error, so the service doesn't starts now:

ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.

Logs:

config.json: 100%|██████████| 628/628 [00:00<00:00, 4.60MB/s]
tokenizer_config.json: 100%|██████████| 1.43k/1.43k [00:00<00:00, 2.19MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 1.74MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 13.3MB/s]
added_tokens.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 263kB/s]
special_tokens_map.json: 100%|██████████| 168/168 [00:00<00:00, 647kB/s]
generation_config.json: 100%|██████████| 111/111 [00:00<00:00, 980kB/s]
eval_results.json: 100%|██████████| 557/557 [00:00<00:00, 3.51MB/s]
colab-demo.ipynb: 100%|██████████| 167k/167k [00:00<00:00, 8.10MB/s]
all_results.json: 100%|██████████| 732/732 [00:00<00:00, 2.55MB/s]
model-00008-of-00008.safetensors: 100%|██████████| 816M/816M [09:25<00:00, 1.44MB/s]
model.safetensors.index.json: 100%|██████████| 23.9k/23.9k [00:00<00:00, 33.6MB/s]/s]
pytorch_model.bin.index.json: 100%|██████████| 23.9k/23.9k [00:00<00:00, 11.1MB/s]/s]
thumbnail.png: 100%|██████████| 510k/510k [00:00<00:00, 2.38MB/s]:26<11:16, 1.52MB/s]
train_results.json: 100%|██████████| 195/195 [00:00<00:00, 1.10MB/s]<12:53, 1.52MB/s]
trainer_state.json: 100%|██████████| 104k/104k [00:00<00:00, 8.57MB/s]4:22, 1.36MB/s]
model-00006-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [17:38<00:00, 1.84MB/s]
model-00001-of-00008.safetensors: 100%|██████████| 1.89G/1.89G [19:09<00:00, 1.64MB/s]
model-00007-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [19:22<00:00, 1.70MB/s]
model-00002-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [19:45<00:00, 1.64MB/s]
model-00005-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [19:50<00:00, 1.66MB/s]
model-00004-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [20:01<00:00, 1.62MB/s]
model-00003-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [20:01<00:00, 1.65MB/s]
Fetching 23 files: 100%|██████████| 23/23 [20:04<00:00, 52.39s/it]:43<00:11, 3.54MB/s]
🚀Tip: run 'openllm build HuggingFaceH4/zephyr-7b-alpha --backend vllm --serialization safetensors' to create a BentoLLM for 'HuggingFaceH4/zephyr-7b-alpha'
2023-12-11T20:22:57+0000 [DEBUG] [cli] Importing service "_service:svc" from working dir: "/usr/local/lib/python3.10/dist-packages/openllm"
2023-12-11T20:22:59+0000 [DEBUG] [cli] Default runner method set to 'generate_iterator', it can be accessed both via 'runner.run' and 'runner.generate_iterator.async_run'.
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'llm-mistral-service' imported from source: bentoml.Service(name="llm-mistral-service", import_str="_service:svc", working_dir="/usr/local/lib/python3.10/dist-packages/openllm")
2023-12-11T20:22:59+0000 [DEBUG] [cli] Runner map: {}
2023-12-11T20:22:59+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service:svc" can be accessed at http://localhost:3000/metrics.
2023-12-11T20:22:59+0000 [INFO] [cli] Installing handle_callback_exception to loop
2023-12-11T20:22:59+0000 [INFO] [cli] Registering signals...
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Arbiter.start' starts
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Arbiter.start' ends
2023-12-11T20:22:59+0000 [INFO] [cli] Starting master on pid 36
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Arbiter.initialize' starts
2023-12-11T20:22:59+0000 [DEBUG] [cli] Socket bound at 0.0.0.0:3000 - fd: 9
2023-12-11T20:22:59+0000 [INFO] [cli] sockets started
2023-12-11T20:22:59+0000 [DEBUG] [cli]     'Watcher.initialize' starts
2023-12-11T20:22:59+0000 [DEBUG] [cli]     'Watcher.initialize' ends
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Arbiter.initialize' ends
2023-12-11T20:22:59+0000 [DEBUG] [cli] Initializing watchers
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Watcher._start' starts
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Watcher._start' ends
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Watcher.reap_processes' starts
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Watcher.reap_processes' ends
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Watcher.spawn_processes' starts
2023-12-11T20:22:59+0000 [DEBUG] [cli] 'Watcher.spawn_processes' ends
2023-12-11T20:22:59+0000 [DEBUG] [cli] cmd: /usr/bin/python3
2023-12-11T20:22:59+0000 [DEBUG] [cli] args: ['-m', 'bentoml_cli.worker.http_api_server', '_service:svc', '--fd', '$(circus.sockets._bento_api_server)', '--runner-map', '{}', '--working-dir', '/usr/local/lib/python3.10/dist-packages/openllm', '--backlog', '2048', '--worker-id', '$(CIRCUS.WID)', '--prometheus-dir', '/root/bentoml/prometheus_multiproc_dir', '--ssl-version', '17', '--ssl-ciphers', 'TLSv1', '--development-mode']
2023-12-11T20:22:59+0000 [DEBUG] [cli] process args: ['/usr/bin/python3', '-m', 'bentoml_cli.worker.http_api_server', '_service:svc', '--fd', '9', '--runner-map', '{}', '--working-dir', '/usr/local/lib/python3.10/dist-packages/openllm', '--backlog', '2048', '--worker-id', '1', '--prometheus-dir', '/root/bentoml/prometheus_multiproc_dir', '--ssl-version', '17', '--ssl-ciphers', 'TLSv1', '--development-mode']
2023-12-11T20:22:59+0000 [DEBUG] [cli] running api_server process [pid 60]
2023-12-11T20:22:59+0000 [INFO] [cli] Arbiter now waiting for commands
2023-12-11T20:22:59+0000 [INFO] [cli] api_server started
2023-12-11T20:22:59+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-12-11T20:23:00+0000 [DEBUG] [api_server:1] Importing service "_service:svc" from working dir: "/usr/local/lib/python3.10/dist-packages/openllm"
2023-12-11T20:23:01+0000 [DEBUG] [api_server:1] Default runner method set to 'generate_iterator', it can be accessed both via 'runner.run' and 'runner.generate_iterator.async_run'.
2023-12-11T20:23:01+0000 [DEBUG] [api_server:1] 'llm-mistral-service' imported from source: bentoml.Service(name="llm-mistral-service", import_str="_service:svc", working_dir="/usr/local/lib/python3.10/dist-packages/openllm")
2023-12-11T20:23:01+0000 [INFO] [api_server:1] Started server process [60]
2023-12-11T20:23:01+0000 [INFO] [api_server:1] Waiting for application startup.
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to acquire lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/node_ip_address.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 acquired on /tmp/ray/session_2023-12-11_20-23-02_476771_60/node_ip_address.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to release lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/node_ip_address.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 released on /tmp/ray/session_2023-12-11_20-23-02_476771_60/node_ip_address.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to acquire lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 acquired on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to release lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 released on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to acquire lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 acquired on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to release lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 released on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to acquire lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 acquired on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to release lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 released on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to acquire lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 acquired on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to release lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 released on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to acquire lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 acquired on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Attempting to release lock 139710685069072 on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
2023-12-11T20:23:02+0000 [DEBUG] [api_server:1] Lock 139710685069072 released on /tmp/ray/session_2023-12-11_20-23-02_476771_60/ports_by_node.json.lock
D1211 20:23:02.492537879      60 config.cc:113]              gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)
D1211 20:23:02.492590180      60 config.cc:113]              gRPC EXPERIMENT tcp_read_chunks                     OFF (default:OFF)
D1211 20:23:02.492595480      60 config.cc:113]              gRPC EXPERIMENT tcp_rcv_lowat                       OFF (default:OFF)
D1211 20:23:02.492597780      60 config.cc:113]              gRPC EXPERIMENT peer_state_based_framing            OFF (default:OFF)
D1211 20:23:02.492600380      60 config.cc:113]              gRPC EXPERIMENT flow_control_fixes                  OFF (default:OFF)
D1211 20:23:02.492602580      60 config.cc:113]              gRPC EXPERIMENT memory_pressure_controller          OFF (default:OFF)
D1211 20:23:02.492604180      60 config.cc:113]              gRPC EXPERIMENT periodic_resource_quota_reclamation OFF (default:OFF)
D1211 20:23:02.492605580      60 config.cc:113]              gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1211 20:23:02.492607180      60 config.cc:113]              gRPC EXPERIMENT new_hpack_huffman_decoder           OFF (default:OFF)
D1211 20:23:02.492609480      60 config.cc:113]              gRPC EXPERIMENT event_engine_client                 OFF (default:OFF)
I1211 20:23:02.492738583      60 ev_epoll1_linux.cc:120]     grpc epoll fd: 23
D1211 20:23:02.492753683      60 ev_posix.cc:141]            Using polling engine: epoll1
D1211 20:23:02.492768983      60 dns_resolver_ares.cc:831]   Using ares dns resolver
D1211 20:23:02.493294594      60 lb_policy_registry.cc:45]   registering LB policy factory for "priority_experimental"
D1211 20:23:02.493518598      60 lb_policy_registry.cc:45]   registering LB policy factory for "weighted_target_experimental"
D1211 20:23:02.493562099      60 lb_policy_registry.cc:45]   registering LB policy factory for "pick_first"
D1211 20:23:02.493568099      60 lb_policy_registry.cc:45]   registering LB policy factory for "round_robin"
D1211 20:23:02.493794704      60 lb_policy_registry.cc:45]   registering LB policy factory for "ring_hash_experimental"
D1211 20:23:02.494032709      60 lb_policy_registry.cc:45]   registering LB policy factory for "grpclb"
D1211 20:23:02.494267213      60 lb_policy_registry.cc:45]   registering LB policy factory for "rls_experimental"
D1211 20:23:02.494735023      60 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_manager_experimental"
D1211 20:23:02.494760823      60 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_impl_experimental"
D1211 20:23:02.494766523      60 lb_policy_registry.cc:45]   registering LB policy factory for "cds_experimental"
D1211 20:23:02.494969927      60 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_resolver_experimental"
D1211 20:23:02.494993028      60 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I1211 20:23:02.496360555      60 socket_utils_common_posix.cc:407] Disabling AF_INET6 sockets because ::1 is not available.
I1211 20:23:02.496390456      60 socket_utils_common_posix.cc:336] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
I1211 20:23:02.496984868      88 subchannel.cc:910]          subchannel 0x5565145352e0 {address=ipv4:172.17.0.2:57365, args={grpc.client_channel_factory=0x5565143b3970, grpc.default_authority=172.17.0.2:57365, grpc.enable_http_proxy=0, grpc.http2.write_buffer_size=524288, grpc.initial_reconnect_backoff_ms=100, grpc.internal.channel_credentials=0x5565143b3950, grpc.internal.security_connector=0x55651453d690, grpc.internal.subchannel_pool=0x5565144eaee0, grpc.keepalive_time_ms=300000, grpc.keepalive_timeout_ms=120000, grpc.max_receive_message_length=536870912, grpc.max_reconnect_backoff_ms=2000, grpc.max_send_message_length=536870912, grpc.min_reconnect_backoff_ms=1000, grpc.primary_user_agent=grpc-c++/1.50.2, grpc.resource_quota=0x5565141e9b40, grpc.server_uri=dns:///172.17.0.2:57365}}: connect failed (UNKNOWN:Failed to connect to remote host: Connection refused {target_address:"ipv4:172.17.0.2:57365", created_time:"2023-12-11T20:23:02.496443457+00:00", errno:111, os_error:"Connection refused", syscall:"connect"}), backing off for 99 ms
I1211 20:23:02.597501291      91 subchannel.cc:867]          subchannel 0x5565145352e0 {address=ipv4:172.17.0.2:57365, args={grpc.client_channel_factory=0x5565143b3970, grpc.default_authority=172.17.0.2:57365, grpc.enable_http_proxy=0, grpc.http2.write_buffer_size=524288, grpc.initial_reconnect_backoff_ms=100, grpc.internal.channel_credentials=0x5565143b3950, grpc.internal.security_connector=0x55651453d690, grpc.internal.subchannel_pool=0x5565144eaee0, grpc.keepalive_time_ms=300000, grpc.keepalive_timeout_ms=120000, grpc.max_receive_message_length=536870912, grpc.max_reconnect_backoff_ms=2000, grpc.max_send_message_length=536870912, grpc.min_reconnect_backoff_ms=1000, grpc.primary_user_agent=grpc-c++/1.50.2, grpc.resource_quota=0x5565141e9b40, grpc.server_uri=dns:///172.17.0.2:57365}}: backoff delay elapsed, reporting IDLE
2023-12-11 20:23:05,168 INFO worker.py:1673 -- Started a local Ray instance.
I1211 20:23:05.800402371      60 server_builder.cc:348]      Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
I1211 20:23:05.800520372      60 tcp_server_posix.cc:337]    Failed to add :: listener, the environment may not support IPv6: UNKNOWN:Address family not supported by protocol {created_time:"2023-12-11T20:23:05.800497167+00:00", errno:97, os_error:"Address family not supported by protocol", syscall:"socket", target_address:"[::]:0"}
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/openllm/_runners.py", line 106, in __init__
self.model = vllm.AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 492, in from_engine_args
distributed_init_method, placement_group = initialize_cluster(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 106, in initialize_cluster
raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.
2023-12-11T20:23:05+0000 [ERROR] [api_server:1] An exception occurred while instantiating runner 'llm-mistral-runner', see details below:
2023-12-11T20:23:05+0000 [ERROR] [api_server:1] Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/openllm/_runners.py", line 106, in __init__
self.model = vllm.AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 492, in from_engine_args
distributed_init_method, placement_group = initialize_cluster(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 106, in initialize_cluster
raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
self._set_handle(LocalRunnerRef)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
runner_handle = handle_class(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
File "/usr/local/lib/python3.10/dist-packages/openllm/_runners.py", line 118, in __init__
raise openllm.exceptions.OpenLLMException(f'Failed to initialise vLLMEngine due to the following error:\n{err}') from err
openllm_core.exceptions.OpenLLMException: Failed to initialise vLLMEngine due to the following error:
The number of required GPUs exceeds the total number of available GPUs in the cluster.

2023-12-11T20:23:05+0000 [ERROR] [api_server:1] Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/openllm/_runners.py", line 106, in __init__
self.model = vllm.AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 492, in from_engine_args
distributed_init_method, placement_group = initialize_cluster(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 106, in initialize_cluster
raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 677, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan
on_startup()
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 317, in init_local
raise e
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
self._set_handle(LocalRunnerRef)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
runner_handle = handle_class(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
File "/usr/local/lib/python3.10/dist-packages/openllm/_runners.py", line 118, in __init__
raise openllm.exceptions.OpenLLMException(f'Failed to initialise vLLMEngine due to the following error:\n{err}') from err
openllm_core.exceptions.OpenLLMException: Failed to initialise vLLMEngine due to the following error:
The number of required GPUs exceeds the total number of available GPUs in the cluster.

2023-12-11T20:23:05+0000 [ERROR] [api_server:1] Application startup failed. Exiting.
aarnphm commented 9 months ago

You need to pass --gpus all to enable GPU on the container.

cay89 commented 9 months ago

You need to pass --gpus all to enable GPU on the container.

I gave it here:

image

cay89 commented 9 months ago

I don't know if:

  1. I'm doing something wrong,
  2. there's something not right with the environment I'm working in,
  3. or it's a bug.
cay89 commented 9 months ago

Maybe it's a vllm bug? I found a workaround for this here:

https://github.com/vllm-project/vllm/issues/1116

But I dont't know how I can implement this with OpenLLM.

aarnphm commented 9 months ago

I will take a look into this once I'm available next week.

We have a logic to determine the number of GPUS here https://github.com/bentoml/OpenLLM/blob/8d989767e838972fe10e02d78bf640904560b85e/openllm-python/src/openllm/_runners.py#L104

bojiang commented 2 months ago

close for openllm 0.6