I encountered an issue related to vLLM when I was trying to run Llama-3 70B PPO, this is likely due to the increased context length of the llama3 model. I was wondering if you guys have got this error before, since you guys have tested with the Mistral model which has a similar or longer context length.
Should we consider adding max_model_len and gpu_memory_utilization to the sampling parameters (of vLLM)?
2024-06-28T06:11:52.685300953Z [36m(LLMRayActor pid=20770)[0m INFO 06-28 06:11:45 distributed_gpu_executor.py:45] # GPU blocks: 378, # CPU blocks: 1638
2024-06-28T06:11:52.685302801Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] Error executing method initialize_cache. This might cause deadlock in distributed execution.
2024-06-28T06:11:52.685304690Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] Traceback (most recent call last):
2024-06-28T06:11:52.685306698Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method
2024-06-28T06:11:52.685308737Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] return executor(*args, **kwargs)
2024-06-28T06:11:52.685310477Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 172, in initialize_cache
2024-06-28T06:11:52.685312385Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] raise_if_cache_size_invalid(num_gpu_blocks,
2024-06-28T06:11:52.685314167Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 340, in raise_if_cache_size_invalid
2024-06-28T06:11:52.685316061Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] raise ValueError(
2024-06-28T06:11:52.685317840Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6048). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
I encountered an issue related to vLLM when I was trying to run Llama-3 70B PPO, this is likely due to the increased context length of the llama3 model. I was wondering if you guys have got this error before, since you guys have tested with the Mistral model which has a similar or longer context length.
Should we consider adding
max_model_len
andgpu_memory_utilization
to the sampling parameters (of vLLM)?