OpenGVLab / LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
https://openlamm.github.io/
286 stars 15 forks source link

performance on evaluating instructblip-vicuna7b and instructblip-flant5xxl #52

Closed sunwhw closed 7 months ago

sunwhw commented 8 months ago

Thanks for your work! I also get the result of 0.06593951412989589 when perform the instructblip-vicuna7b on the 2d-ScienceQA-LAMM. But when i changed the model arch to instructblip-flant5xxl, i can get the result of 0.66, why such a big difference? And, the instructblip's ScienceQA result on leaderboard is 0.5518 , so the result is geting from which one arch?

Coach257 commented 8 months ago

Thanks for your interest. The result is getting from instructblip-vicuna7b, and we use ChEF for evaluation. Run:

python eval.py --model_cfg=config/ChEF/models/instructblip.yaml --recipe_cfg=config/ChEF/scenario_recipes/ScienceQA/default.yaml
sunwhw commented 8 months ago

Thanks!I've pushed out some results! But I still want to confirm with you whether the leaderboard results are all based on "default.yaml" for the corresponding dataset? For example, for the “FSC“, the result is getting from "src/config/ChEF/scenario_recipes/FSC147/default.yaml"! Because I thought it should come from "src/config/ChEF/scenario_recipes/LAMM/FSC147.yaml" before, but the results were very different. Is there a detailed explanation of the source of results on leaderboard? If no, Could you please sync all the config details to leaderboard? Since the name of benchmark is "LAMM", I directly ran it under the "LAMM” folder before, but the results were very diferent from the leaderboards.

Coach257 commented 7 months ago

Thanks for your suggestion! The results on the leaderboardd are all based on the ChEF/scenario_recipes/xxx/default.yaml. Also, for users who want to use LAMM benchmark for evaluation, we keep the origin evaluation configs in ChEF/scenario_recipes/LAMM/xxx.yaml. To be noted, these evaluation settings are not recommanded, as we believe that ChEF provides a more fair and reasonable evaluation pipeline.

We understand there maybe something confused between the results on the leaderboard and different configs supported in our code. We will provide a more detailed explaination and clean the configs sooner or later. Thanks for your suggestion again.