OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.82k stars 829 forks source link

[BUG] <title>vllm的视频推理示例代码明显错误,且会在vllm/model_executor/models/minicpmv.py报错 #469

Closed younger-diao closed 4 weeks ago

younger-diao commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

1、实例代码错误;2、vllm/model_executor/models/minicpmv.py 468行get_placeholder(images[i].size, i)报错 rank0: Traceback (most recent call last): rank0: File "/data/diaohf/Multi-Model/MiniCPM-V-2.6/demo_vllm_video.py", line 61, in rank0: outputs = llm.generate({ rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/utils.py", line 895, in inner rank0: return fn(*args, **kwargs) rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 323, in generate

rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 552, in _validate_and_add_requests

rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 568, in _add_request

rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 654, in add_request rank0: processed_inputs = self.process_model_inputs( rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 594, in process_model_inputs rank0: return self.input_processor(llm_inputs) rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/inputs/registry.py", line 202, in process_input rank0: return processor(InputContext(model_config), inputs) rank0: File "/data/diaohf/anaconda3/envs/MiniCPMV2.6/lib/python3.10/site-packages/vllm/model_executor/models/minicpmv.py", line 471, in input_processor_for_minicpmv rank0: get_placeholder(images[i].size, i) rank0: KeyError: 0

期望行为 | Expected Behavior

修复bug

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

HwwwwwwwH commented 4 weeks ago

视频推理需要传参数, vllm 官方暂不支持,建议使用我们fork的代码

git clone https://github.com/OpenBMB/vllm
cd vllm
git checkout minicpmv
pip install e .

另外能否再提供一下最外层调用vllm进行推理的代码

younger-diao commented 4 weeks ago

from transformers import AutoTokenizer from decord import VideoReader, cpu from PIL import Image from vllm import LLM, SamplingParams import time

MAX_NUM_FRAMES = 32 def encode_video(filepath): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs] vr = VideoReader(filepath, ctx=cpu(0)) sample_fps = round(vr.get_avg_fps() / 1) # FPS frame_idx = [i for i in range(0, len(vr), sample_fps)] if len(frame_idx)>MAX_NUM_FRAMES: frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) video = vr.get_batch(frame_idx).asnumpy() video = [Image.fromarray(v.astype('uint8')) for v in video] return video

MODEL_NAME = "/data/models/MiniCPM-V-2_6" # or local model path tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) llm = LLM( model=MODEL_NAME, gpu_memory_utilization=1, trust_remote_code=True, max_model_len=4096 ) start = time.time() video = encode_video("/data/tokyo_people.mp4") messages = [{ "role": "user", "content": "".join(["(./)"] * len(video)) + \ "\n详细描述一下这个视频的内容" }]

prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True )

stop_tokens = ['<|im_end|>', '<|endoftext|>'] stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

sampling_params = SamplingParams( stop_token_ids=stop_token_ids, use_beam_search=True, temperature=0,

top_p=0.8,

# top_k=100,
# repetition_penalty=1.05,
max_tokens=64,
best_of=3

)

outputs = llm.generate({ "prompt": prompt, "multi_modal_data": { "image": { "images": video, "use_image_id": False, "max_slice_nums": 1 if len(video) > 16 else 2 } } }, sampling_params=sampling_params)

finish = time.time() print('Predicted in %f seconds.' % ((finish - start))) print(outputs[0].outputs[0].text)

HwwwwwwwH commented 4 weeks ago

使用 video 需要加入一些参数,vllm 官方支持的方式是

outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": image # or [image] * len
}
}, sampling_params=sampling_params)

我们的视频会引入新的参数,如你代码里面写的,vllm暂时不支持,所以我们 fork 了一个仓库来进行支持。你可以试试我们的仓库你是否可以跑通。 如果不想拉取仓库进行安装还可以试试直接去 config.jsonpreprocessor_config.json 里面改对应的参数("use_image_id", "max_slice_nums"),这里就保持不送参数的调用不变了。

younger-diao commented 4 weeks ago

使用 video 需要加入一些参数,vllm 官方支持的方式是

outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": image # or [image] * len
}
}, sampling_params=sampling_params)

我们的视频会引入新的参数,如你代码里面写的,vllm暂时不支持,所以我们 fork 了一个仓库来进行支持。你可以试试我们的仓库你是否可以跑通。 如果不想拉取仓库进行安装还可以试试直接去 config.jsonpreprocessor_config.json 里面改对应的参数("use_image_id", "max_slice_nums"),这里就保持不送参数的调用不变了。

多谢指点

seanzhang-zhichen commented 2 days ago

使用 video 需要加入一些参数,vllm 官方支持的方式是

outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": image # or [image] * len
}
}, sampling_params=sampling_params)

我们的视频会引入新的参数,如你代码里面写的,vllm暂时不支持,所以我们 fork 了一个仓库来进行支持。你可以试试我们的仓库你是否可以跑通。 如果不想拉取仓库进行安装还可以试试直接去 config.jsonpreprocessor_config.json 里面改对应的参数("use_image_id", "max_slice_nums"),这里就保持不送参数的调用不变了。

能否提供正确的vllm调用代码,按照官方飞书里的代码无法跑通