QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
3.11k stars 189 forks source link

What is the Qwen2-VL Max HF Demo config? #285

Open octavflorescu opened 1 month ago

octavflorescu commented 1 month ago

What is the Qwen2-VL Max HF Demo config? https://huggingface.co/spaces/Qwen/Qwen2-VL

In the demo from this repo, i found the setup for 7B, but is Qwen2-VL-Max the same? Could someone please provide the same setup as for the demo, but in 'tutorial' mode? (not hf demo/worker mode)

Thank you!

e.g., but such that all results will be the same as being run in the hf demo...

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2-VL-72B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": ???, "video": ???},
)

sampling_params = SamplingParams(
    temperature=???,
    top_p=???,
    repetition_penalty=1.05,
    max_tokens=???,
    stop_token_ids=[],
)

# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",

processor = AutoProcessor.from_pretrained(MODEL_PATH)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": ???,
                "max_pixels": ???,
            },
            {"type": "text",
             "text": "Read OCR"},
        ],
    },
]

prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)
ShuaiBai623 commented 1 month ago

They are different. The demo of https://huggingface.co/spaces/Qwen/Qwen2-VL uses the API of Qwen2-VL-Max, but it's the same model behind the scenes.

octavflorescu commented 1 month ago

Thank you for your answer! So is it the same model and same configuration as per the demo in this repo, so Max==7b (not 72b) ?

ShuaiBai623 commented 1 month ago

Max==72b

ShuaiBai623 commented 1 month ago

the hf demo code is here. https://huggingface.co/spaces/Qwen/Qwen2-VL/blob/main/app.py the same setup for the demo is the default setup for api: top_p=0.1, rp=1.1