又出现如下错误:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-05 21:11:03 config.py:676] Chunked prefill is enabled (EXPERIMENTAL).
2024-06-05 21:11:05,275 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-05 21:11:05 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/joyue/model/glm-4-9b-chat', speculative_config=None, tokenizer='/joyue/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/joyue/model/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-05 21:11:05 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 06-05 21:11:08 model_runner.py:146] Loading model weights took 17.5635 GB
INFO 06-05 21:11:10 distributed_gpu_executor.py:56] # GPU blocks: 0, # CPU blocks: 6553
ERROR 06-05 21:11:10 worker_base.py:148] Error executing method initialize_cache. This might cause deadlock in distributed execution.
ERROR 06-05 21:11:10 worker_base.py:148] Traceback (most recent call last):
ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-05 21:11:10 worker_base.py:148] return executor(*args, kwargs)
ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
ERROR 06-05 21:11:10 worker_base.py:148] raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid
ERROR 06-05 21:11:10 worker_base.py:148] raise ValueError("No available memory for the cache blocks. "
ERROR 06-05 21:11:10 worker_base.py:148] ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.
rank0: Traceback (most recent call last):
rank0: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 46, in rank0: engine, tokenizer = load_model_and_tokenizer(MODEL_PATH)
rank0: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 42, in load_model_and_tokenizer
rank0: engine = AsyncLLMEngine.from_engine_args(engine_args)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
rank0: engine = cls(
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in initrank0: self.engine = self._init_engine(*args, *kwargs)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
rank0: return engine_class(args, kwargs)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches
rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
rank0: driver_worker_output = self.driver_worker.execute_method(
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
rank0: raise e
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
rank0: return executor(*args, **kwargs)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid
rank0: raise ValueError("No available memory for the cache blocks. "
rank0: ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine
System Info / 系統信息
python3.10 cuda2.2 显存24G
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
vllm_cli_demo.py文件
engine_args = AsyncEngineArgs( model=model_dir, tokenizer=model_dir, tensor_parallel_size=1, dtype="bfloat16", trust_remote_code=True, gpu_memory_utilization=0.3, enforce_eager=True, worker_use_ray=True, engine_use_ray=False, disable_log_requests=True
如果遇见 OOM 现象,建议开启下述参数
上面的配置,报 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 的错误
按照官方的提示将上面的参数调整为 engine_args = AsyncEngineArgs( model=model_dir, tokenizer=model_dir, tensor_parallel_size=1, dtype="bfloat16", trust_remote_code=True, gpu_memory_utilization=0.3, enforce_eager=True, worker_use_ray=True, engine_use_ray=False, disable_log_requests=True,
如果遇见 OOM 现象,建议开启下述参数
又出现如下错误: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-05 21:11:03 config.py:676] Chunked prefill is enabled (EXPERIMENTAL). 2024-06-05 21:11:05,275 INFO worker.py:1749 -- Started a local Ray instance. INFO 06-05 21:11:05 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/joyue/model/glm-4-9b-chat', speculative_config=None, tokenizer='/joyue/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/joyue/model/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-05 21:11:05 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 06-05 21:11:08 model_runner.py:146] Loading model weights took 17.5635 GB INFO 06-05 21:11:10 distributed_gpu_executor.py:56] # GPU blocks: 0, # CPU blocks: 6553 ERROR 06-05 21:11:10 worker_base.py:148] Error executing method initialize_cache. This might cause deadlock in distributed execution. ERROR 06-05 21:11:10 worker_base.py:148] Traceback (most recent call last): ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method ERROR 06-05 21:11:10 worker_base.py:148] return executor(*args, kwargs) ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache ERROR 06-05 21:11:10 worker_base.py:148] raise_if_cache_size_invalid(num_gpu_blocks, ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid ERROR 06-05 21:11:10 worker_base.py:148] raise ValueError("No available memory for the cache blocks. " ERROR 06-05 21:11:10 worker_base.py:148] ValueError: No available memory for the cache blocks. Try increasing
rank0: engine, tokenizer = load_model_and_tokenizer(MODEL_PATH)
rank0: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 42, in load_model_and_tokenizer
rank0: engine = AsyncLLMEngine.from_engine_args(engine_args)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
rank0: engine = cls(
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init
rank0: self.engine = self._init_engine(*args, *kwargs)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
rank0: return engine_class(args, kwargs)
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init
gpu_memory_utilization
when initializing the engine. rank0: Traceback (most recent call last): rank0: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 46, inrank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers rank0: driver_worker_output = self.driver_worker.execute_method( rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method rank0: raise e rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method rank0: return executor(*args, **kwargs) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid rank0: raise ValueError("No available memory for the cache blocks. " rank0: ValueError: No available memory for the cache blocks. Try increasing
gpu_memory_utilization
when initializing the engineExpected behavior / 期待表现
请问如何解决?