EvolvingLMMs-Lab / LongVA

Long Context Transfer from Language to Vision
Apache License 2.0
182 stars 11 forks source link

Thank you for your work. How can we reproduce the V-NIAH benchmark results? #7

Open fistyee opened 1 week ago

jzhang38 commented 1 week ago

We cannot provide the haystack video ourselves as we use an actual movie in our evaluation. Specifically, I use the movie "孤注一掷(no more bet)" as the haystack :)

The rest instructions are written the README:

https://github.com/EvolvingLMMs-Lab/LongVA?tab=readme-ov-file#v-niah-evaluation

Let me know if you encounter any problems.

fistyee commented 1 week ago

Thanks, could you describe what the duration distribution of clips extracted from movies is, and can you provide more query prompts for LongVA as a reference?

jzhang38 commented 1 week ago

could you describe what the duration distribution of clips extracted from movies i

We do not use clips from the movie. We load the entire movie as the haystack video and sample frames at 1 fps, as stated in the paper and also reflected in our code:

https://github.com/EvolvingLMMs-Lab/LongVA/blob/efc27fdcc9cdc411dee8af296aa1a34ebb29d445/vision_niah/produce_haystack_embedding.py#L12-L21

can you provide more query prompts for LongVA as a reference?

I am not sure what you mean by "query prompts". If you are looking for needle images & questions, they are here: https://huggingface.co/datasets/lmms-lab/v_niah_needles. If you are looking for the prompt template: https://github.com/EvolvingLMMs-Lab/LongVA/blob/efc27fdcc9cdc411dee8af296aa1a34ebb29d445/vision_niah/eval_vision_niah.py#L48-L51