Closed sunwhw closed 7 months ago
About the results on ScienceQA
I have tested instructblip-flant5 and instructblip-vicuna with config/ChEF/scenario_recipes/ScienceQA/default yaml
, the results are 48.68 and 53.99 respectively. It seems that you may have some misalignment in configuration or model checkpoints compared to us, resulting in inconsistent results. Both checkpoints are downloaded by Instructblip under the default setting. Maybe you can provide more details for us to locate the problem.
About the results on CIFAR10 CIFAR10 and ScienceQA are two completely different tasks, and it is reasonable to expect differences in performance between different models on these tasks. We believe that here is a noticeable tug-of-war problem across different tasks for the MLLMs.
About the fairness and stability As MLLMs are free-form question-answering models, it is expected to have variations in results under different settings. For example, different prompt can lead to different results. This challenge is one of the motivation behind our work on ChEF, where we achieved relatively stable results using the PPL inferencer, as demonstrated in our paper. It is important to note that different models trained on different data or prompts may have different acceptable inputs. By adjusting the query pool and providing inputs that the model can process effectively, you can achieve the desired performance. You can try different prompts to test the model's results and select the best one as the final outcome. We believe that evaluating the model under such setting is also fair and reasonable.
Thanks for your detailed reply!
I get the following results with config/ChEF/scenario_recipes/Dataset/default.yaml
, it almost aligned with leaderboard, but there is some offset compared your repo 1.Is this normal? 2.have you ever used the same model and the same test setup, but got different results?
【p.s.】 For instructblip_flant5 or vicuna's pth, I keep same with lavis's instruction config. For example, for the instructblip_flant5xxl:
We understand that there may be gaps between different test runs, and we consider it reasonable. Even with the same model and setup, we can still obtain different results. Although there is no random sampling of output tokens in the PPL inferencer, there can still be some variation in the results. We believe that the model still exhibits some level of randomness.
Of course, it is reasonable to expect randomness in tasks like VQA since the generated CoT content is randomly generated by setting do_sample=True
.
oh, Thanks for your help and the clean CHEF's code and framework! I've learned a lot!
Thanks for your great work!