如何用vllm对qwen2-vl-72b推理？

waltonfuture commented 2 weeks ago

我用四张80g显卡，使用仓库提供的vllm脚本，但是一直爆显存，始终没法跑起来。请问vllm脚本需要做哪些改动呢？

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "/path/Qwen2-VL-72B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "porsche.jpg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "What is the text in the illustrate?"},
        ],
    },
]
# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

waltonfuture commented 2 weeks ago

已解决

Mr-Loevan commented 3 days ago

puts = process_vision_info(messages)

mm_data = {} if image_inputs is not None: mm_data["image"] = image_inputs if video_inputs is not None: mm_data["video"] = video_inputs

请问如何解决的？

lifrank1765 commented 3 days ago

请问是改limit_mm_per_prompt={"image": 10, "video": 10}为limit_mm_per_prompt={"image": 1, "video": 1}解决的吗？

waltonfuture commented 3 days ago

llm初始化那里要加个tensor parallel的参数，设置为显卡数量

QwenLM / Qwen2-VL

如何用vllm对qwen2-vl-72b推理？ #518