EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.54k stars 128 forks source link

[Feature] Break down response generation and response validation that requires external API #163

Open zwcolin opened 2 months ago

zwcolin commented 2 months ago

Thanks for the awesome codebase! I wonder if you have a sample script which would allow one to break down model evaluations into generating responses and validating responses into two separate procedures? Many benchmarks e.g., mmbench, mmvet, hallusionbench, mathvista use GPT-assisted response validation. However, I believe that many compute clusters do not grant Internet access to compute nodes and conversely compute to nodes with Internet. Therefore, it'd be helpful to break down model response generation and response validation into two phases if it better fits users' workflow. If such functionality is already supported, could you provide an example script? Or could you implement this functionality? Thank you so much!

Luodian commented 2 months ago

Yes, I think we should do this but currently we are out of bandwidth for it.

We wish if anyone could try to implement this, very much thanks if someone can do.

zwcolin commented 2 months ago

Thanks for the message Bo! I was raising this issue since I'm trying to integrate our recently released benchmark that already includes results for 35 models (https://charxiv.github.io/) to your repo but I could not test my integration due to being able to separate model generation from response grading as two separate runs. I wouldn't mind helping implement this functionality but it'd be great if someone from your team can help coordinate on this. I have also fixed some bugs with your process_docs in my fork. Let me know if we could make this happen and thanks for all the effort for this awesome project!

kcz358 commented 2 months ago

Hi @zwcolin , do you think the from log model can help you with this? You can first generate output logs using --predict-only flag and then use the from_log model to simply perform aggregation and processing results.

You can check the last section of this blog for more detail

Luodian commented 2 months ago

@zwcolin Great! you can join our discord to have a close discussion with our teams.

https://discord.gg/zdkwKUqrPy

And you can ping me, I am BobaGPT at that channel.

zwcolin commented 2 months ago

Hi @zwcolin , do you think the from log model can help you with this? You can first generate output logs using --predict-only flag and then use the from_log model to simply perform aggregation and processing results.

You can check the last section of this blog for more detail

Hi Kaichen, thanks for the pointer! I'll try this later today and see if it works.

@zwcolin Great! you can join our discord to have a close discussion with our teams.

https://discord.gg/zdkwKUqrPy

And you can ping me, I am BobaGPT at that channel.

I joined your dc a few days ago and I'll ping you later :)