-
We would like to evaluate the model performance for various LLM fine tuning approaches and compare them with the standard benchmarks. An experiment we would like to try is:
- **Compare the full car…
-
Hi,
The wandb logger chokes if a group contains some tasks that output numbers and some that output strings. This is either a bug in `WandbLogger.log_eval_samples` or in the `openllm` group (maybe …
-
There seems to be a discrepancy between the [leaderboard](https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard) and this repository, which may end up meaning that models were benchmar…
-
This is just a reminder that in some cases where the network is restricted, **please remember to set the environment variable HF_DATASETS_OFFLINE to 1 to enable full offline mode**. This will prevent …
-
Hi, @haileyschoelkopf Thank you for your awsome open-source work. We have been evaluating using `lm-eval` and noticed that when using `accelerate` for data parallel inference, the number of GPUs utili…
-
### Describe the feature
How to use compass to evaluate the local alpaca model on MMLU and other datasets
### Will you implement it?
- [ ] I would like to implement this feature and create a PR!
-
Hi,
While running `leaderboard_mmlu_pro` evals I've noticed an unexpected space character. Here is an example request:
```
2024-09-25:06:46:53,199 INFO [evaluator_utils.py:200] Request: Insta…
-
When running with the example run_spec: `mmlu:subject=anatomy,model=openai/gpt2` and no caching, the HuggingFace client outputs the following warning on every call:
```
Setting `pad_token_id` to `…
-
For multilingual arc, the [original implementation](https://github.com/nlp-uoregon/mlmm-evaluation/blob/main/lm_eval/tasks/multilingual_arc.py) has 25 shots but in lm_eval, it doesn't
https://github.…
-
# Motivation
I want to use MMLU results by task to better understand the characteristics of LLMs. I am curious to see the differences between architectures and how performance in the tasks change as…