Using Language Models as Evaluators

lintangsutawika commented 4 months ago

How can we integrate the use of language models to evaluate language model generations?

Currently, lm-eval evaluates language model generations with conventional metrics such as accuracy, bleu, etc. This is proven to have shortcomings such as frequently being a poor proxy of progress or performance. There have been current methods such as using GPT-4 to evaluate other language model generations, other open-source approaches such as Prometheus have also gain interest.

Basic Requirements:

Call lm-eval to directly evaluate previously-saved model generation
The language model "judge" would be loaded as a metric and output a score. Supports both API-Based and HF models. API-based models will likely need to have several different class objects for each provider (OpenAI, Cohere, Google) while the class for HF model would be a single implementation. All classes should inherit from a base class so that future contributors can easily propose their own. Main methods for base class should be around (1) scoring a single sample (2) scoring a batch of samples (3) aggregating results. This should be compatible with how metrics are currently used or implemented. If required, we can also revisit how to refactor metrics in lm-eval.
Should these judges also need metric parameters that are configurable via a yaml?

Advanced/Optional Requirements:

Since metric evaluation occurs after all generations are obtained, it should be possible to flush the GPUs and load the language model judge to GPU for faster.
Option to evaluate judges themselves through means like inter-model agreement

lintangsutawika commented 4 months ago

Multi-step call would probably be more easier to implement.

artemorloff commented 4 months ago

@lintangsutawika why not separate multi-step/multi-round from llm-as-a-judge? usually multi-step requires the same model to generate on prompt that includes the previous generation. Seems quite different and no need for such a complicated structure to implement these tasks. Evenmore, llm-as-a-judge seems to be some kind of a metric. Multi-step tasks may require that the currect promt include all previous lm answers which leads to calling the same lm to be judge multiple times. or i got something wrong?

lintangsutawika commented 4 months ago

Llm-as-a-judge can be considered a type of metric. The point being the score typically associated with a heuristic is now outputted by a language model which may or may not be the same type/version of the model being evaluated. This isn't really related to the multistep idea but is something that could be beneficial for evaluating multistep tasks.

SeungoneKim commented 3 months ago

@lintangsutawika Is there currently any initiative for this feature? I would love to help

lintangsutawika commented 3 months ago

Yes, currently @baberabb is working on a PR for this. I think it'd be great if you can provide some feedback based on your experience on Prometheus.

SeungoneKim commented 2 months ago

@baberabb Hello Baber, nice to meet you! I'd love to collaborate with you on working on this! On which platform could I best communicate with you (e.g., slack, discord)?

haileyschoelkopf commented 1 month ago

@SeungoneKim @lintangsutawika @baberabb and I met to discuss LLM-as-a-Judge.

key takeaways:

we'll be implementing a 2-step approach first: step 1) run our to-be-eval'ed model on a benchmark and log its outputs to disk ; step 2) run a judge model over the generated outputs and obtain the final scores
To allow full user control over judge model settings such as prompt, temperature, answer extraction, self-consistency, etc. LLM-as-Judge scoring pipelines will take the form of a whole other task config YML that determines these parameters (and into which any LM type can be used as the judge model of choice.). These tasks will additionally ingest model responses from step 1).
@baberabb will start by looking at #1627 so that we can rescore "vanilla" non-judge tasks and so that we can also use that for LLM-as-Judge type tasks.
@SeungoneKim is going to write out a number of desirable features for full completeness, or usefulness of this feature to researchers, and send it here!

EleutherAI / lm-evaluation-harness

Using Language Models as Evaluators #1831