Open jpan72 opened 1 month ago
Good question! It's hard to evaluate the performance directly.
For stage1, it works like BLIP2 stage1. You can use retrieval tasks to verify it. For stage2, it only use video cpation or image caption, and it's hard to follow instructions. Thus we just verify it by some selected examples and check whether the output video/image captions are reasonable. For stage3, we use MVBench.
Hello authors,
What validation dataset did you use to during the training epochs of stage 1, 2, and 3, respectively? I believe validation accuracy is important to monitor model convergence and avoid issues like over-fitting.
Thank you!