vLLM related: model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6048). - Githubissues

OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)

https://openrlhf.readthedocs.io/

Apache License 2.0

1.71k stars 160 forks source link

vLLM related: model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6048). #336

Closed mickelliu closed 4 days ago

mickelliu commented 4 days ago

I encountered an issue related to vLLM when I was trying to run Llama-3 70B PPO, this is likely due to the increased context length of the llama3 model. I was wondering if you guys have got this error before, since you guys have tested with the Mistral model which has a similar or longer context length.

Should we consider adding max_model_len and gpu_memory_utilization to the sampling parameters (of vLLM)?

2024-06-28T06:11:52.685300953Z [36m(LLMRayActor pid=20770)[0m INFO 06-28 06:11:45 distributed_gpu_executor.py:45] # GPU blocks: 378, # CPU blocks: 1638
2024-06-28T06:11:52.685302801Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] Error executing method initialize_cache. This might cause deadlock in distributed execution.
2024-06-28T06:11:52.685304690Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] Traceback (most recent call last):
2024-06-28T06:11:52.685306698Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method
2024-06-28T06:11:52.685308737Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145]     return executor(*args, **kwargs)
2024-06-28T06:11:52.685310477Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 172, in initialize_cache
2024-06-28T06:11:52.685312385Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145]     raise_if_cache_size_invalid(num_gpu_blocks,
2024-06-28T06:11:52.685314167Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 340, in raise_if_cache_size_invalid
2024-06-28T06:11:52.685316061Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145]     raise ValueError(
2024-06-28T06:11:52.685317840Z [36m(LLMRayActor pid=20770)[0m ERROR 06-28 06:11:45 worker_base.py:145] ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6048). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

hijkzzz commented 4 days ago

That sounds reasonable.

mickelliu commented 2 days ago

Thank you, that was quick.