OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
12.76k stars 894 forks source link

[BUG] <使用vLLM基于Nvidia A10 GPU 运行MiniCPM-Llama3-V-2_5本地推理出现OOM> #369

Closed leeaction closed 4 months ago

leeaction commented 4 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

运行vLLM里的micpmv报OOM

(myenv) root@dsw-416183-5ff8594b99-h8jts:/mnt/workspace/vllm# python examples/offline_inference_vision_language.py --model-type minicpmv

INFO 07-29 17:41:49 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='openbmb/MiniCPM-Llama3-V-2_5', speculative_config=None, tokenizer='openbmb/MiniCPM-Llama3-V-2_5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=openbmb/MiniCPM-Llama3-V-2_5, use_v2_block_manager=False, enable_prefix_caching=False) INFO 07-29 17:41:50 model_runner.py:720] Starting to load model openbmb/MiniCPM-Llama3-V-2_5... INFO 07-29 17:41:51 weight_utils.py:224] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:00<00:02, 2.43it/s] Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:00<00:02, 2.41it/s] Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:01<00:01, 2.34it/s] Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:01<00:01, 2.38it/s] Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:02<00:00, 2.41it/s] Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:02<00:00, 2.44it/s] Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:02<00:00, 2.44it/s] Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:02<00:00, 2.42it/s]

INFO 07-29 17:41:54 model_runner.py:732] Loading model weights took 15.9524 GB /opt/conda/envs/myenv/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead warnings.warn( rank0: Traceback (most recent call last): rank0: File "/mnt/workspace/vllm/examples/offline_inference_vision_language.py", line 188, in

rank0: File "/mnt/workspace/vllm/examples/offline_inference_vision_language.py", line 140, in main rank0: llm, prompt = model_example_mapmodel rank0: File "/mnt/workspace/vllm/examples/offline_inference_vision_language.py", line 96, in run_minicpmv rank0: llm = LLM( rank0: File "/mnt/workspace/vllm/vllm/entrypoints/llm.py", line 155, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/mnt/workspace/vllm/vllm/engine/llm_engine.py", line 447, in from_engine_args rank0: engine = cls( rank0: File "/mnt/workspace/vllm/vllm/engine/llm_engine.py", line 265, in init

rank0: File "/mnt/workspace/vllm/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches

rank0: File "/mnt/workspace/vllm/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks rank0: return self.driver_worker.determine_num_available_blocks() rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/mnt/workspace/vllm/vllm/worker/worker.py", line 179, in determine_num_available_blocks

rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/mnt/workspace/vllm/vllm/worker/model_runner.py", line 935, in profile_run rank0: self.execute_model(model_input, kv_caches, intermediate_tensors) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/mnt/workspace/vllm/vllm/worker/model_runner.py", line 1354, in execute_model rank0: hidden_or_intermediate_states = model_executable( rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/mnt/workspace/vllm/vllm/model_executor/models/minicpmv.py", line 619, in forward rank0: vlm_embeddings, vision_hidden_states = self.get_embedding(inputs) rank0: File "/mnt/workspace/vllm/vllm/model_executor/models/minicpmv.py", line 562, in get_embedding rank0: vision_hidden_states = self.get_vision_hidden_states(data) rank0: File "/mnt/workspace/vllm/vllm/model_executor/models/minicpmv.py", line 545, in get_vision_hidden_states rank0: vision_embedding = self.vpm( rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 628, in forward rank0: encoder_outputs = self.encoder( rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 553, in forward rank0: layer_outputs = encoder_layer( rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 465, in forward rank0: hidden_states, attn_weights = self.self_attn( rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/opt/conda/envs/myenv/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 233, in forward rank0: attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale rank0: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.66 GiB. GPU

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 Off | 00000000:00:08.0 Off | 0 | | 0% 37C P8 15W / 150W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

我看readme里MiniCPM-Llama3-V-2_5的最小显存要求是19G

期望行为 | Expected Behavior

可以正常运行

复现方法 | Steps To Reproduce

python examples/offline_inference_vision_language.py --model-type minicpmv

运行环境 | Environment

- OS:Ubuntu 20.04
- Python:3.10
- Transformers:4.43.3
- PyTorch:2.3.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

HwwwwwwwH commented 4 months ago

可以试试减小 max_model_len,如果是起服务的话添加参数 --max_model_len 2048,本地直接推理的话就在初始化 LLM 的时候添加。