Hi authors, thanks for your amazing work which contributes to long video understanding a lot!
I'm repeating your experiments on LLaVA-NeXT-Video. I meet some problems and would like to know how you are solving them.
Would you mind providing details on which LLaVA-NeXT-Video model you are testing on, lmms-lab/LLaVA-NeXT-Video-34B-DPO or lmms-lab/LLaVA-NeXT-Video-7B-DPO or models without DPO?
I experiment with lmms-lab/LLaVA-NeXT-Video-7B-DPO at first and find that current instruction didn't request the assistant to answer with one exact option. They sometimes reply with a paragraph of reasoning, which makes extracting answers not easy. Would you mind provide your answer extraction script or shed some light on how to evaluate based on raw response? (Or are you evaluating in a perplexity-based mode? like connecting four options with the instruction seperately to see which one has lower perpleixity.)
When I'm evaluating on LVBench with the official repo of LLaVA-NeXT-Video, I find that some videos cannot be read in because decord library does not support AV1 codec currently. I edit video2dataset package as in this issue and re-download LVBench videos using your download.sh, but there are still four videos fail to be processed. I really appreciate it if you can share experiment details like how you are downloading and dealing with AV1 codec stuff.
There's a very interesting point mentioned in section 4.4 in the paper about "using large language models (LLMs) to filter question-answer pairs". But I'm a little bit confused about what LLM filtering means. Is it like only provide the instruction and question as input without the video and check which option will the LLM guess to choose? But it's hard to understand that this method get a even higher score than input with video. Would you mind making a simple clarify on this interesting discovery?
You can use this script to get the final option. (No perplexity-based mode.)
We save all the videos to mp4 files. You can convert the video using ffmpeg first, then read them using decord.
We find some questions are easy for the LLM to guess the answer. So we remove them out from the original dataset. Below is a filtered example from the original dataset:
{
'question': 'Why is the whole body of a man covered with white cloth?\n(A) He is sleepy\n(B) He is dead\n(C) He is tired\n(D) He is married',
'answer': 'B'
}
Hi authors, thanks for your amazing work which contributes to long video understanding a lot! I'm repeating your experiments on LLaVA-NeXT-Video. I meet some problems and would like to know how you are solving them.
download.sh
, but there are still four videos fail to be processed. I really appreciate it if you can share experiment details like how you are downloading and dealing with AV1 codec stuff.Thanks in advance for your helpful reply.