Compared to existing libraries such as evaluation-harness and HELM, this repo enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers
in lm-evaluation-harness? Or also python scripts/regression.py --models multiple-models --tasks multiple-tasks. It also supports most HF models and some OpenAI and Anthropic models.
For
isn't
roughly the same as
in lm-evaluation-harness? Or also
python scripts/regression.py --models multiple-models --tasks multiple-tasks
. It also supports most HF models and some OpenAI and Anthropic models.