[Discussion/Feedback] VLM + Multimodal benchmarking

haileyschoelkopf commented 8 months ago

This is an issue to discuss the feasibility / desirability of multimodal benchmarks to be included and supported in lm-eval. With the increase in multimodal (V)LMs and in benchmarks like MMMU designed to test these models, it's worth discussing either whether lm-eval is easily extensible to these tests or whether we wish to have these in scope for the library.

ashvinnihalani commented 5 months ago

There is already a fork of lm-evalaution-harness built by the LLaVa team called lmms-eval that is focused on VLM evaluation that can serve a PoC. For what is worth, I already have a private fork of lm-eval that has added LLaVa support and is working with MMMU and the LLava 1.5 7B model through the original codebase. I think the main considerations of an implementation are the following:

How tightly do we want to couple VLM and MM benchmarks? Should VLM only be able to run MM benchmarks should we enable both LM and VLM to run MM benchmarks. Do we need to ensure the VLM can run text -only benchmarks
Are VLM first party citizens? Do we need to focus on optimizations for these? For example VLLM just introduced VLM support a couple of days ago. Do we need to pull the latest benchmarks
Significant effort needs to be invested in optimization of the VLMM benchmarks as image data/video data in particle is very big, at what point does it becomes a problem

If the project owners are aligned on these questions, I can clean up and submit a PR for my LLaVa implementation.

ashvinnihalani commented 3 months ago

Wanted to follow up

ashvinnihalani commented 3 months ago

Just wanted to give a heads up that I started the PR.

EleutherAI / lm-evaluation-harness

[Discussion/Feedback] VLM + Multimodal benchmarking #1155