EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.35k stars 1.68k forks source link

[Discussion/Feedback] VLM + Multimodal benchmarking #1155

Open haileyschoelkopf opened 8 months ago

haileyschoelkopf commented 8 months ago

This is an issue to discuss the feasibility / desirability of multimodal benchmarks to be included and supported in lm-eval. With the increase in multimodal (V)LMs and in benchmarks like MMMU designed to test these models, it's worth discussing either whether lm-eval is easily extensible to these tests or whether we wish to have these in scope for the library.

ashvinnihalani commented 5 months ago

There is already a fork of lm-evalaution-harness built by the LLaVa team called lmms-eval that is focused on VLM evaluation that can serve a PoC. For what is worth, I already have a private fork of lm-eval that has added LLaVa support and is working with MMMU and the LLava 1.5 7B model through the original codebase. I think the main considerations of an implementation are the following:

If the project owners are aligned on these questions, I can clean up and submit a PR for my LLaVa implementation.

ashvinnihalani commented 3 months ago

Wanted to follow up

ashvinnihalani commented 3 months ago

Just wanted to give a heads up that I started the PR.