OpenGVLab / LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
https://openlamm.github.io/
284 stars 15 forks source link

how to ensure the fairness and stability of this test #57

Closed sunwhw closed 7 months ago

sunwhw commented 7 months ago

Thanks for your great work!

  1. have you tested instructblip-flant5 based on CHEF? For the same task, why the result of flant5 is quite different from vicuna? For example, with "SRC/config/ChEF/scenario_recipes/ScienceQA/default yaml", vicuna7b get "23.03", but flant5xl only get "7.23"? 2.Why is this reversed after changing the config.yaml? For example, when set to "scenario_recipes/LAMM/CIFAR10 yaml", flant5 performance is better than vicuna, which is so weird. So what do you suggest to ensure the fairness and stability of this test?
Coach257 commented 7 months ago
  1. About the results on ScienceQA I have tested instructblip-flant5 and instructblip-vicuna with config/ChEF/scenario_recipes/ScienceQA/default yaml, the results are 48.68 and 53.99 respectively. It seems that you may have some misalignment in configuration or model checkpoints compared to us, resulting in inconsistent results. Both checkpoints are downloaded by Instructblip under the default setting. Maybe you can provide more details for us to locate the problem.

  2. About the results on CIFAR10 CIFAR10 and ScienceQA are two completely different tasks, and it is reasonable to expect differences in performance between different models on these tasks. We believe that here is a noticeable tug-of-war problem across different tasks for the MLLMs.

  3. About the fairness and stability As MLLMs are free-form question-answering models, it is expected to have variations in results under different settings. For example, different prompt can lead to different results. This challenge is one of the motivation behind our work on ChEF, where we achieved relatively stable results using the PPL inferencer, as demonstrated in our paper. It is important to note that different models trained on different data or prompts may have different acceptable inputs. By adjusting the query pool and providing inputs that the model can process effectively, you can achieve the desired performance. You can try different prompts to test the model's results and select the best one as the final outcome. We believe that evaluating the model under such setting is also fair and reasonable.

sunwhw commented 7 months ago

Thanks for your detailed reply! I get the following results with config/ChEF/scenario_recipes/Dataset/default.yaml, it almost aligned with leaderboard, but there is some offset compared your repo 1.Is this normal? 2.have you ever used the same model and the same test setup, but got different results?

image

【p.s.】 For instructblip_flant5 or vicuna's pth, I keep same with lavis's instruction config. For example, for the instructblip_flant5xxl:

image
Coach257 commented 7 months ago

We understand that there may be gaps between different test runs, and we consider it reasonable. Even with the same model and setup, we can still obtain different results. Although there is no random sampling of output tokens in the PPL inferencer, there can still be some variation in the results. We believe that the model still exhibits some level of randomness.

Of course, it is reasonable to expect randomness in tasks like VQA since the generated CoT content is randomly generated by setting do_sample=True.

sunwhw commented 7 months ago

oh, Thanks for your help and the clean CHEF's code and framework! I've learned a lot!