dannypei commented 3 months ago

System Info / 系統信息

python3.10 cuda2.2 显存24G

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

vllm_cli_demo.py文件

engine_args = AsyncEngineArgs( model=model_dir, tokenizer=model_dir, tensor_parallel_size=1, dtype="bfloat16", trust_remote_code=True, gpu_memory_utilization=0.3, enforce_eager=True, worker_use_ray=True, engine_use_ray=False, disable_log_requests=True

如果遇见 OOM 现象，建议开启下述参数

    #enable_chunked_prefill=True,
    #max_num_batched_tokens=8192
)

上面的配置，报 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 的错误

按照官方的提示将上面的参数调整为 engine_args = AsyncEngineArgs( model=model_dir, tokenizer=model_dir, tensor_parallel_size=1, dtype="bfloat16", trust_remote_code=True, gpu_memory_utilization=0.3, enforce_eager=True, worker_use_ray=True, engine_use_ray=False, disable_log_requests=True,

如果遇见 OOM 现象，建议开启下述参数

    enable_chunked_prefill=True,
    max_num_batched_tokens=8192
)

又出现如下错误： Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-05 21:11:03 config.py:676] Chunked prefill is enabled (EXPERIMENTAL). 2024-06-05 21:11:05,275 INFO worker.py:1749 -- Started a local Ray instance. INFO 06-05 21:11:05 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/joyue/model/glm-4-9b-chat', speculative_config=None, tokenizer='/joyue/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/joyue/model/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-05 21:11:05 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 06-05 21:11:08 model_runner.py:146] Loading model weights took 17.5635 GB INFO 06-05 21:11:10 distributed_gpu_executor.py:56] # GPU blocks: 0, # CPU blocks: 6553 ERROR 06-05 21:11:10 worker_base.py:148] Error executing method initialize_cache. This might cause deadlock in distributed execution. ERROR 06-05 21:11:10 worker_base.py:148] Traceback (most recent call last): ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method ERROR 06-05 21:11:10 worker_base.py:148] return executor(*args, kwargs) ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache ERROR 06-05 21:11:10 worker_base.py:148] raise_if_cache_size_invalid(num_gpu_blocks, ERROR 06-05 21:11:10 worker_base.py:148] File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid ERROR 06-05 21:11:10 worker_base.py:148] raise ValueError("No available memory for the cache blocks. " ERROR 06-05 21:11:10 worker_base.py:148] ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. rank0: Traceback (most recent call last): rank0: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 46, in rank0: engine, tokenizer = load_model_and_tokenizer(MODEL_PATH) rank0: File "/joyue/work/pythonwork/GLM-4/basic_demo/vllm_cli_demo.py", line 42, in load_model_and_tokenizer rank0: engine = AsyncLLMEngine.from_engine_args(engine_args) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args rank0: engine = cls( rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine rank0: return engine_class(args, kwargs) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init

rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache

rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers rank0: driver_worker_output = self.driver_worker.execute_method( rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method rank0: raise e rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method rank0: return executor(*args, **kwargs) rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache

rank0: File "/home/ps/miniconda3/envs/glm4/lib/python3.10/site-packages/vllm/worker/worker.py", line 370, in raise_if_cache_size_invalid rank0: raise ValueError("No available memory for the cache blocks. " rank0: ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine