2024-03-19 19:52:14,449 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 19:52:14,450 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 19:52:14,649 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = None
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = int8
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
self._init_workers_ray(placement_group)
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 268, in _init_workers_ray
self.driver_worker = Worker(
^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
self.kv_quant_params = (self.load_kv_quant_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
kv_quant_params.append(kv_quant_param)
^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
2024-03-19 19:52:19,750 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerAphrodite.init_worker() (pid=26429, ip=172.17.0.2, actor_id=537d7fe532ba3d411a06c1f001000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f34058b5b50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/ray_tools.py", line 22, in init_worker
self.worker = worker_init_fn()
^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 252, in <lambda>
lambda rank=rank, local_rank=local_rank: Worker(
^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
self.kv_quant_params = (self.load_kv_quant_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
kv_quant_params.append(kv_quant_param)
^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:
(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --load-in-4bit
WARNING: bnb quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-19 20:03:18,803 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 20:03:18,804 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 20:03:18,984 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = bnb
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=36344) WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO: Downloading model weights ['*.safetensors']
(RayWorkerAphrodite pid=36344) INFO: Downloading model weights ['*.safetensors']
INFO: Memory allocated for converted model: 9.17 GiB
INFO: Memory reserved for converted model: 9.26 GiB
INFO: Model weights loaded. Memory usage: 9.17 GiB x 2 = 18.34 GiB
with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.
Oops, nevermind. I didn't read the documentation. Sorry, lol. You might want to put that in boldface or something on the main page where you mention it.
Your current environment
🐛 Describe the bug
(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --kv-cache-dtype int8
Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:
with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.