Open psych0v0yager opened 2 months ago
Hi,
Based on what I understand,
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": """You are a highly advanced AI model with a specific purpose. Your purpose is to rate the quality of videos on a scale from 1-10. Be as descriptive as possible."""},
],
},
{
"role": "user",
"content": [
{"type": "text", "text": """Try to match the following quality:"""},
{
"type": "video",
"video": "video.mp4",
"max_pixels": 128 * 128,
"fps": 4.0,
},
{"type": "text", "text": """You should rate the previous video as `10`."""},
{
"type": "video",
"video": "video1.mp4",
"max_pixels": 128 * 128,
"fps": 4.0,
},
{"type": "text", "text": """You should rate the previous video as `4`. It is decent. There is some blurriness in the top left corner"""},
{
"type": "video",
"video": "video2.mp4",
"max_pixels": 128 * 128,
"fps": 4.0,
},
{"type": "text", "text": """Now, rate the quality of this video."""},
],
}
]
Thank you for the suggestion.
It is interesting qwen prefers a user prompt with multiple videos to simulate a multi turn chat, compared to conventional LLMs which use an alternating user/assistant scheme.
I figured since the qwen models were not explicitly taught to "rate quality" a few shot scheme would be the best way to teach them how to perform well.
I was also considering a quick fine tune, but I am not sure video fine tuning is supported yet.
Thank you for the suggestion.
It is interesting qwen prefers a user prompt with multiple videos to simulate a multi turn chat, compared to conventional LLMs which use an alternating user/assistant scheme.
I figured since the qwen models were not explicitly taught to "rate quality" a few shot scheme would be the best way to teach them how to perform well.
I was also considering a quick fine tune, but I am not sure video fine tuning is supported yet.
I've also come across the situation of enhancing Qwen2-VL with few-shot, can you share your last Message Settings about multiple videos to simulate a multi turn chat
?
I would appreciate it if you could help.
@psych0v0yager I have a similar use case: evaluating the quality of views from a fisheye camera (lens cleanliness, reflections, sharpness). I've tried the same approach as you by providing a few examples (good quality / poor quality), but it seems that the model always answer with little variation.
Have you achieved better results with Qwen2-VL since then?
Output (every time)
The model will repeat the first assistant message verbatim instead of analyzing the new video.
The Qwen blogpost stated multi turn video chat is supported, would I be better off using the vLLM implementation or is it currently not recommended to use multi turn video with the 7B model?