Request for NExTQA Dataset Evaluation Prompt and More Results on Challenging Datasets for Fair Comparison

To my knowledge, the videos in NExTQA dataset are relatively short, with an average video length of 44 seconds, and there is a noted static bias[1] in the ActivityNet QA dataset. Could you present further results on more demanding datasets for fair comparison, such as EgoSchema[2]? Additionally, Could I request that you supply the evaluation prompt for the NeXTQA dataset?

[1] Lei, Jie et al. “Revealing Single Frame Bias for Video-and-Language Learning.” ArXiv abs/2206.03428 (2022): n. pag. [2] Mangalam, Karttikeya et al. “EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding.” ArXiv abs/2308.09126 (2023): n. pag.

LLaVA-VL / LLaVA-NeXT

Request for NExTQA Dataset Evaluation Prompt and More Results on Challenging Datasets for Fair Comparison #3