The following is a list of TODOs to implement LLM-as-a-Judge in Eval-Harness:
TLDR
Splits existing evaluate function into classification_evaluate and generation_evaluate.
Enables the user decide to run a 2 stage pipeline (generating responses and generation_evaluate) or only the later.
Support of pre-defined high-level functions (listed in Desirable Features below).
Implementation Adjustments
[ ] In /lm_eval/evaluator.py, split the evaluate function into classification_evaluate and generation_evaluate.
[ ] Existing evaluate function could be re-named as classification_evaluate
[ ] generation_evaluate would handle the newly added LLM-as-a-Judge functionality.
[ ] evaluate function could determine whether to call the classification_evaluate function or generation_evaluate function.
[ ] Add arguments to YAML file
[ ] load_model_response: Decides whether to call the 2 stage pipeline (generating responses & evaluating them with an evaluator) or not.
[ ] Could be either true/false and the default value is set as false.
[ ] If user sets this as true, they should provide an argument for model_response_dir.
[ ] model_response_dir: Calls a JSON file which consists of a list of dictionaries. Each dictionary has three keys: instruction, reference_answer and response.
[ ] Is ignored if load_model_response is set to false.
[ ] Must be provided if load_model_response is set to true.
[
{
"instruction": "Tell me 5 ways to pass my math exam.",
"reference_answer": "Absolutely! You can prepare your test as follows: ...",
"response": "Sure, I can help you! First, ..."
},
...
]
[ ] In metric_list:
[ ] Add LLM-as-a-Judge to metric
[ ] evaluator: Decides which evaluator to use.
[ ] evaluatee: Decides which model would be evaluated. Similar to model_args, it could load a huggingface checkpoint or a local checkpoint.
[ ] Should be ignored if load_model_response is set to true.
[ ] Must be provided if load_model_response is set to false.
[ ] llm-as-a-judge_meta-prompt: Similar with doc_to_text, but used to prompt the evaluator model.
[ ] Make a few examples inside of utils/llm-as-a-judge_meta-prompt.py file.
[ ] Keys to include from original YAML file:
[ ] tag: Add generation which calls generation_evaluate instead of classification_evaluate.
[ ] task
[ ] dataset_path: Must be included if load_model_response is set to false.
[ ] dataset_name: Must be included if load_model_response is set to false.
[ ] output_type: Add free-form
[ ] training_split
[ ] validation_split
[ ] test_split
[ ] fewshot_split: Ignored if load_model_response is set to true.
[ ] doc_to_text: Must be included if load_model_response is set to false.
Desirable Features
Evaluation Format
[ ] Pointwise Assessment: Given a single response, the evaluator assigns a scalar value score to the response.
[ ] Pairwise Assessment: Given two responses, the evaluator decides which response is better.
[ ] Listwise Assessment: Given at least three responses, the evaluator reranks the responses.
[ ] Could internally use Pointwise Assessment or Pairwise Assessment.
Support of Reference-free and Reference-based evaluations
Controllability for degree of Fine-grained Evaluation
Most naive evaluation criteria: helpfulness, harmlessness
Task-level evaluation criteria
Instance-level evaluation criteria
Inclusion of Verbal Feedback
[ ] Need parsing function to split the verbal feedback and scoring decision.
Parsing functions
[ ] If prompted to generate in a dictionary, there should be a pre-defined function that checks the validity of the output.
[ ] For direct assessment, should check if the scoring decision is in the desired range (e.g., 1-5).
[ ] For pairwise assessment, should check if the scoring decision is included in the user-provided pre-defined list (e.g., A or B / response 1 or response 2)
Specifically, it would be great to evaluate the logicality of the rationale, not correctness answer itself. This would be great to distinguish false-positives (responses that got the answer correct but has a very bad rationale to derive it)!
@haileyschoelkopf @lintangsutawika @baberabb
The following is a list of TODOs to implement LLM-as-a-Judge in Eval-Harness:
TLDR
Splits existing
evaluate
function intoclassification_evaluate
andgeneration_evaluate
.Enables the user decide to run a 2 stage pipeline (
generating responses
andgeneration_evaluate
) or only the later.Support of pre-defined high-level functions (listed in
Desirable Features
below).Implementation Adjustments
[ ] In
/lm_eval/evaluator.py
, split theevaluate
function intoclassification_evaluate
andgeneration_evaluate
.evaluate
function could be re-named asclassification_evaluate
generation_evaluate
would handle the newly added LLM-as-a-Judge functionality.evaluate
function could determine whether to call theclassification_evaluate
function orgeneration_evaluate
function.[ ] Add arguments to YAML file
load_model_response
: Decides whether to call the 2 stage pipeline (generating responses
&evaluating them with an evaluator
) or not.model_response_dir
.model_response_dir
: Calls a JSON file which consists of a list of dictionaries. Each dictionary has three keys:instruction
,reference_answer
andresponse
.load_model_response
is set to false.load_model_response
is set to true.metric_list
:LLM-as-a-Judge
tometric
evaluator
: Decides which evaluator to use.evaluatee
: Decides which model would be evaluated. Similar tomodel_args
, it could load a huggingface checkpoint or a local checkpoint.load_model_response
is set to true.load_model_response
is set to false.llm-as-a-judge_meta-prompt
: Similar withdoc_to_text
, but used to prompt the evaluator model.utils/llm-as-a-judge_meta-prompt.py
file.tag
: Addgeneration
which callsgeneration_evaluate
instead ofclassification_evaluate
.task
dataset_path
: Must be included ifload_model_response
is set to false.dataset_name
: Must be included ifload_model_response
is set to false.output_type
: Addfree-form
training_split
validation_split
test_split
fewshot_split
: Ignored ifload_model_response
is set to true.doc_to_text
: Must be included ifload_model_response
is set to false.Desirable Features
helpfulness
,harmlessness