declare-lab / instruct-eval

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
https://declare-lab.github.io/instruct-eval/
Apache License 2.0
528 stars 42 forks source link

Regarding the comparison to lm-evaluation-harness #10

Open gakada opened 1 year ago

gakada commented 1 year ago

For

Compared to existing libraries such as evaluation-harness and HELM, this repo enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers

isn't

python main.py mmlu --model_name llama --model_path some-llama

roughly the same as

python main.py --model_args pretrained=some-llama,... --tasks hendrycksTest* --num_fewshot 5

in lm-evaluation-harness? Or also python scripts/regression.py --models multiple-models --tasks multiple-tasks. It also supports most HF models and some OpenAI and Anthropic models.