AI21Labs / factor

Code and data for the FACTOR paper
MIT License
36 stars 2 forks source link

How to use Factor to evaluate instruction-tuned LLMs? #2

Open HillZhang1999 opened 8 months ago

HillZhang1999 commented 8 months ago

Dear authors,

Firstly, I would like to express my gratitude for your exceptional work.

Recently, I attempted to utilize Factor for evaluating instruction-tuned models, such as llama2-chat. However, I observed that the evaluation format of Factor is primarily designed for text completion, making it more suitable for base models rather than instruction-tuned models.

In an effort to instruct SFT models, I experimented with prompts such as "Please complete the following text." However, their performance still falls behind that of base models. This differs from the results I obtained when conducting experiments on other benchmarks, such as TruthfulQA.

I would greatly appreciate any insights or suggestions you may have. Thank you!

dorm-ai21 commented 8 months ago

Hey @HillZhang1999

Thanks for your interest in our paper!

We have recently added Expert-FACTOR, based on ExpertQA (https://arxiv.org/abs/2309.07852), a long-from question answering dataset. In order to adapt the QA task to text completion, we first concatenated each question-answer pair into a single doc, and then utilized the FACTOR data pipeline. Since the benchmark evaluates factuality in a more task-specific scenario, it may be more suited for evaluating instruction-tuned model.

HillZhang1999 commented 8 months ago

Thanks a lot! I will try them asap.

ishaan-jaff commented 8 months ago

Hi @HillZhang1999 @dorm-ai21 if you're still trying to benchmark / eval llms

I'm the maintainer of LiteLLM. I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)

Try it here: https://docs.litellm.ai/docs/simple_proxy https://github.com/BerriAI/litellm

Using LiteLLM Proxy Server

Creating a proxy server

Ollama models

$ litellm --model ollama/llama2 --api_base http://localhost:11434

Hugging Face Models

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1

Anthropic

$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1

Palm

$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison

Set api base to proxy

openai.api_base = "http://0.0.0.0:8000"

Using to run an eval on lm harness:

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age