QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
3.27k stars 203 forks source link

Strange behavior in multi turn video chatting #181

Open psych0v0yager opened 2 months ago

psych0v0yager commented 2 months ago
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from transformers import Qwen2VLProcessor
from awq.models.qwen2vl import Qwen2VLAWQForConditionalGeneration

# Specify paths and hyperparameters for quantization
model_path = "/Qwen/Qwen2-VL-7B-Instruct-AWQ"
quant_path = "Qwen2-VL-7B-Instruct-AWQ"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

processor = Qwen2VLProcessor.from_pretrained(model_path)

model = Qwen2VLAWQForConditionalGeneration.from_pretrained(
    model_path, model_type="qwen2_vl", use_cache=False,torch_dtype="auto", device_map="auto"
)

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "video",
                "video": "video.mp4",
                "max_pixels": 128 * 128,
                "fps": 4.0,
            },
            {"type": "text", "text": """You are a highly advanced AI model with a specific purpose. Your purpose is to rate the quality of videos on a scale from 1-10. Be as descriptive as possible. The following video is a 10. Try to match this quality.
             """},
        ],
    },

    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "video1.mp4",
                "max_pixels": 128 * 128,
                "fps": 4.0,
            },
            {"type": "text", "text": """Can you rate the quality of this video."""},
        ],
    },

    {
        "role": "assistant",
        "content": [{"type": "text", "text": "4. It is decent. There is some blurriness in the top left corner"}]
    },

    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "video2.mp4",
                "max_pixels": 128 * 128,
                "fps": 4.0,
            },
            {"type": "text", "text": """Can you rate the quality of this video"""},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

print("\nDebug: Applied chat template:")
print(text)

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
# print(output_text)

print("\nDebug: Raw output text:")
print(output_text)

# Extract only the new content (assuming it's the last item in the output)
new_response = output_text[-1].strip()

print("\nNew response (rating for the second mesh):")
print(new_response)

Output (every time)

['4. It is decent. There is some blurriness in the top left corner']

New response (rating for the second mesh):
4. It is decent. There is some blurriness in the top left corner

The model will repeat the first assistant message verbatim instead of analyzing the new video.

The Qwen blogpost stated multi turn video chat is supported, would I be better off using the vLLM implementation or is it currently not recommended to use multi turn video with the 7B model?

jklj077 commented 2 months ago

Hi,

Based on what I understand,

  1. It is not currently supported to supply videos or images in the system message. They will be ignored.
  2. For multiple turns, if the user query text is the same, it could happen that the model will repeat the previous response, especially for the smaller models.
  3. For this use case, I think you could try moving the rating examples to the user message. For example (not guaranteed to work though, may need some adjustments),
    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": """You are a highly advanced AI model with a specific purpose. Your purpose is to rate the quality of videos on a scale from 1-10. Be as descriptive as possible."""},
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": """Try to match the following quality:"""},
                {
                    "type": "video",
                    "video": "video.mp4",
                    "max_pixels": 128 * 128,
                    "fps": 4.0,
                },
                {"type": "text", "text": """You should rate the previous video as `10`."""},
                {
                    "type": "video",
                    "video": "video1.mp4",
                    "max_pixels": 128 * 128,
                    "fps": 4.0,
                },
                {"type": "text", "text": """You should rate the previous video as `4`. It is decent. There is some blurriness in the top left corner"""},
                {
                    "type": "video",
                    "video": "video2.mp4",
                    "max_pixels": 128 * 128,
                    "fps": 4.0,
                },
                {"type": "text", "text": """Now, rate the quality of this video."""},
            ],
        }
    ]
psych0v0yager commented 2 months ago

Thank you for the suggestion.

It is interesting qwen prefers a user prompt with multiple videos to simulate a multi turn chat, compared to conventional LLMs which use an alternating user/assistant scheme.

I figured since the qwen models were not explicitly taught to "rate quality" a few shot scheme would be the best way to teach them how to perform well.

I was also considering a quick fine tune, but I am not sure video fine tuning is supported yet.

Davidwhw commented 3 weeks ago

Thank you for the suggestion.

It is interesting qwen prefers a user prompt with multiple videos to simulate a multi turn chat, compared to conventional LLMs which use an alternating user/assistant scheme.

I figured since the qwen models were not explicitly taught to "rate quality" a few shot scheme would be the best way to teach them how to perform well.

I was also considering a quick fine tune, but I am not sure video fine tuning is supported yet.

I've also come across the situation of enhancing Qwen2-VL with few-shot, can you share your last Message Settings about multiple videos to simulate a multi turn chat? I would appreciate it if you could help.

gredin commented 5 days ago

@psych0v0yager I have a similar use case: evaluating the quality of views from a fisheye camera (lens cleanliness, reflections, sharpness). I've tried the same approach as you by providing a few examples (good quality / poor quality), but it seems that the model always answer with little variation.

Have you achieved better results with Qwen2-VL since then?