Open HillZhang1999 opened 8 months ago
Hey @HillZhang1999
Thanks for your interest in our paper!
We have recently added Expert-FACTOR, based on ExpertQA (https://arxiv.org/abs/2309.07852), a long-from question answering dataset. In order to adapt the QA task to text completion, we first concatenated each question-answer pair into a single doc, and then utilized the FACTOR data pipeline. Since the benchmark evaluates factuality in a more task-specific scenario, it may be more suited for evaluating instruction-tuned model.
Thanks a lot! I will try them asap.
Hi @HillZhang1999 @dorm-ai21 if you're still trying to benchmark / eval llms
I'm the maintainer of LiteLLM. I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)
Try it here: https://docs.litellm.ai/docs/simple_proxy https://github.com/BerriAI/litellm
Ollama models
$ litellm --model ollama/llama2 --api_base http://localhost:11434
Hugging Face Models
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1
Anthropic
$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1
Palm
$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison
openai.api_base = "http://0.0.0.0:8000"
python3 -m lm_eval \
--model openai-completions \
--model_args engine=davinci \
--task crows_pairs_english_age
Dear authors,
Firstly, I would like to express my gratitude for your exceptional work.
Recently, I attempted to utilize Factor for evaluating instruction-tuned models, such as llama2-chat. However, I observed that the evaluation format of Factor is primarily designed for text completion, making it more suitable for base models rather than instruction-tuned models.
In an effort to instruct SFT models, I experimented with prompts such as "Please complete the following text." However, their performance still falls behind that of base models. This differs from the results I obtained when conducting experiments on other benchmarks, such as TruthfulQA.
I would greatly appreciate any insights or suggestions you may have. Thank you!