Question on evaluation across multiple tasks

Hello. Thanks for your great effort on this comprehensive codebase!

I'm now trying to integrate our MM-UPD Bench into this codebase. https://arxiv.org/abs/2403.20331

I have a question about whether we can evaluate across multiple tasks. Specifically, we need to evaluate across multiple tasks (e.g., datasets on AAD ver. and Standard ver.) when calculating UPD accuracy. If we were to replace the question with the use of MMBench, the question would be whether it is possible to evaluate with both MMBench_en and MMBench_cn datasets simultaneously.

For example, if there are questions with common indices in both MMBench_en and MMBench_cn, I want to consider them correct if both predictions for the same index are answered correctly.

If there is any other way to achieve the above, please let me know🙇

Thanks for your cooperation.

EvolvingLMMs-Lab / lmms-eval

Question on evaluation across multiple tasks #86