EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.93k stars 1.85k forks source link

Support for Using Multiple Choice Datasets with GPT-4o Model via OpenAI API #2326

Closed Laplace888 closed 1 month ago

Laplace888 commented 1 month ago

I am using the GPT-4o model with the openai-chat-completions API. While evaluating various datasets in a task, I encountered an error when the output_type is set to multiple_choice.

I tried using the openai-completions API to resolve the issue, but it appears that GPT-4o is not supported there.

Is there any other solution to this problem?

Error when using openai-chat-completions

lm_eval --model openai-chat-completions ^
    --model_args model=gpt-4o ^
    --tasks mmlu_anatomy ^
    --output_path ./result ^
    --apply_chat_template ^
    --log_samples ^
    --show_config ^
    --limit 10

Traceback (most recent call last):
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\KT\miniconda3\envs\gpt_test\Scripts\lm_eval.exe\__main__.py", line 7, in <module>
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\evaluator.py", line 301, in simple_evaluate
    results = evaluate(
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\evaluator.py", line 476, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\models\openai_completions.py", line 168, in loglikelihood
    raise NotImplementedError(
NotImplementedError: Loglikelihood is not supported for chat completions. Consider using the completions API instead.

Error when using openai-completions

lm_eval --model openai-completions ^
    --model_args model=gpt-4o ^
    --tasks mmlu_anatomy ^
    --output_path ./result ^
    --apply_chat_template ^
    --show_config ^
    --limit 10

Traceback (most recent call last):
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\KT\miniconda3\envs\gpt_test\Scripts\lm_eval.exe\__main__.py", line 7, in <module>
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\evaluator.py", line 301, in simple_evaluate
    results = evaluate(
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\evaluator.py", line 476, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\models\openai_completions.py", line 201, in loglikelihood
    return super().loglikelihood(requests, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\api\model.py", line 374, in loglikelihood
    context_enc, continuation_enc = self._encode_pair(context, continuation)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\api\model.py", line 343, in _encode_pair
    n_spaces = len(context) - len(context.rstrip())
AttributeError: 'JsonChatStr' object has no attribute 'rstrip'

Error when using openai-completions without --apply_chat_template

lm_eval --model openai-completions ^
    --model_args model=gpt-4o ^
    --tasks mmlu_anatomy ^
    --output_path ./result ^
    --show_config ^
    --limit 10

. Retrying...
2024-09-20:13:16:22,323 WARNING  [api_models.py:338] API request failed with error message: {
  "error": {
    "message": "This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?",
    "type": "invalid_request_error",
    "param": "model",
    "code": null
  }
}
. Retrying...
Traceback (most recent call last):
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\KT\miniconda3\envs\gpt_test\Scripts\lm_eval.exe\__main__.py", line 7, in <module>
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\evaluator.py", line 301, in simple_evaluate
    results = evaluate(
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\evaluator.py", line 476, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\models\openai_completions.py", line 201, in loglikelihood
    return super().loglikelihood(requests, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\api\model.py", line 378, in loglikelihood
    return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\models\api_models.py", line 493, in _loglikelihood_tokens
    outputs = retry(
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\tenacity\__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\tenacity\__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\tenacity\__init__.py", line 376, in iter
    result = action(retry_state)
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\tenacity\__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\tenacity\__init__.py", line 185, in reraise
    raise self.last_attempt.result()
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\tenacity\__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "C:\Users\KT\Desktop\conda_GPT\lm-evaluation-harness\lm_eval\models\api_models.py", line 341, in model_call
    response.raise_for_status()
  File "C:\Users\KT\miniconda3\envs\gpt_test\lib\site-packages\requests\models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.openai.com/v1/completions
baberabb commented 1 month ago

Hey @Laplace888. multiple-choice tasks depend on logprobs and OpenAI does not provide those for any of the newer models (only davinci-002 iirc). I'll add a more detailed error message

Sshubam commented 1 month ago

Hey @baberabb , I am facing the same issue. But OpenAI provides logprobs: https://platform.openai.com/docs/api-reference/chat/create#:~:text=the%20relevant%20token.-,logprobs,-boolean%20or%20null

Can you please elaborate on how it is not possible to evaluate GPT-4o on MCQ questions using eval harness?

Thanks.

baberabb commented 1 month ago

Hey @baberabb , I am facing the same issue. But OpenAI provides logprobs: https://platform.openai.com/docs/api-reference/chat/create#:~:text=the%20relevant%20token.-,logprobs,-boolean%20or%20null

Can you please elaborate on how it is not possible to evaluate GPT-4o on MCQ questions using eval harness?

Thanks.

Sorry, should have said prompt logprobs. For multiple-choice we are only concerned with the logprobs of the inputs. See https://github.com/EleutherAI/lm-evaluation-harness/issues/942#issuecomment-1777836312