TODOs for Implementing LLM-as-a-Judge in Eval-Harness (Work in Progress)

@haileyschoelkopf @lintangsutawika @baberabb

The following is a list of TODOs to implement LLM-as-a-Judge in Eval-Harness:

TLDR

Splits existing evaluate function into classification_evaluate and generation_evaluate.
Enables the user decide to run a 2 stage pipeline (generating responses and generation_evaluate) or only the later.
Support of pre-defined high-level functions (listed in Desirable Features below).
Implementation Adjustments
[ ] In /lm_eval/evaluator.py, split the evaluate function into classification_evaluate and generation_evaluate.
- [ ] Existing evaluate function could be re-named as classification_evaluate
- [ ] generation_evaluate would handle the newly added LLM-as-a-Judge functionality.
- [ ] evaluate function could determine whether to call the classification_evaluate function or generation_evaluate function.
[ ] Add arguments to YAML file
- [ ] load_model_response: Decides whether to call the 2 stage pipeline (generating responses & evaluating them with an evaluator) or not.
  - [ ] Could be either true/false and the default value is set as false.
  - [ ] If user sets this as true, they should provide an argument for model_response_dir.
- [ ] model_response_dir: Calls a JSON file which consists of a list of dictionaries. Each dictionary has three keys: instruction, reference_answer and response.
  - [ ] Is ignored if load_model_response is set to false.
  - [ ] Must be provided if load_model_response is set to true.
```
[
{
"instruction": "Tell me 5 ways to pass my math exam.",
"reference_answer": "Absolutely! You can prepare your test as follows: ...",
"response": "Sure, I can help you! First, ..."
},
...
]
```
- [ ] In metric_list:
  - [ ] Add LLM-as-a-Judge to metric
    - [ ] evaluator: Decides which evaluator to use.
    - [ ] evaluatee: Decides which model would be evaluated. Similar to model_args, it could load a huggingface checkpoint or a local checkpoint.
    - [ ] Should be ignored if load_model_response is set to true.
    - [ ] Must be provided if load_model_response is set to false.
- [ ] llm-as-a-judge_meta-prompt: Similar with doc_to_text, but used to prompt the evaluator model.
  - [ ] Make a few examples inside of utils/llm-as-a-judge_meta-prompt.py file.
- [ ] Keys to include from original YAML file:
  - [ ] tag: Add generation which calls generation_evaluate instead of classification_evaluate.
  - [ ] task
  - [ ] dataset_path: Must be included if load_model_response is set to false.
  - [ ] dataset_name: Must be included if load_model_response is set to false.
  - [ ] output_type: Add free-form
  - [ ] training_split
  - [ ] validation_split
  - [ ] test_split
  - [ ] fewshot_split: Ignored if load_model_response is set to true.
  - [ ] doc_to_text: Must be included if load_model_response is set to false.
Desirable Features
- Evaluation Format
  - [ ] Pointwise Assessment: Given a single response, the evaluator assigns a scalar value score to the response.
  - [ ] Pairwise Assessment: Given two responses, the evaluator decides which response is better.
  - [ ] Listwise Assessment: Given at least three responses, the evaluator reranks the responses.
    - [ ] Could internally use Pointwise Assessment or Pairwise Assessment.
- Support of Reference-free and Reference-based evaluations
- Advanced Evaluation Methods
  - [ ] LLM-as-Juries (https://arxiv.org/abs/2404.18796)
  - [ ] Self-Consistency Decoding with Evaluators (https://arxiv.org/abs/2406.05761)
  - [ ] Multi-Agent Debate for Evaluations (https://arxiv.org/abs/2308.07201)
  - [ ] Meta-Evaluation Language Models (https://arxiv.org/abs/2407.19594): Best-of-N sampling for evaluator models
- Controllability for degree of Fine-grained Evaluation
  - Most naive evaluation criteria: helpfulness, harmlessness
  - Task-level evaluation criteria
  - Instance-level evaluation criteria
- Inclusion of Verbal Feedback
  - [ ] Need parsing function to split the verbal feedback and scoring decision.
- Parsing functions
  - [ ] If prompted to generate in a dictionary, there should be a pre-defined function that checks the validity of the output.
  - [ ] For direct assessment, should check if the scoring decision is in the desired range (e.g., 1-5).
  - [ ] For pairwise assessment, should check if the scoring decision is included in the user-provided pre-defined list (e.g., A or B / response 1 or response 2)
- Support for popular datasets
  - [ ] AlpacaEval 2.0 LC (https://arxiv.org/abs/2404.04475v1)
  - [ ] MT-Bench (https://arxiv.org/abs/2306.05685)
  - [ ] Arena-Hard (https://arxiv.org/abs/2406.11939)
  - [ ] BiGGen Bench (Shameless self-promotion) (https://arxiv.org/abs/2406.05761)
  - [ ] GSM8K (https://arxiv.org/abs/2110.14168)
    - Specifically, it would be great to evaluate the logicality of the rationale, not correctness answer itself. This would be great to distinguish false-positives (responses that got the answer correct but has a very bad rationale to derive it)!

EleutherAI / lm-evaluation-harness

TODOs for Implementing LLM-as-a-Judge in Eval-Harness (Work in Progress) #2233