-
the clickext class should be extended to figure out if a flag is either an HF path OR a local path. Based off of which type of argument is passed, there should be a log and specific behavior for the d…
-
Hello,
I want to express my gratitude for your outstanding work. The powerful lm-evaluation-harness and your continuous maintenance have made LLM-evaluation much more convenient.
However, I hav…
-
**❗BEFORE YOU BEGIN❗**
Are you on discord? 🤗 We'd love to have you asking questions on discord instead: https://discord.com/invite/a3K9c8GRGt
**Describe the bug**
A clear and concise description …
-
Even when specifying a different `--output-path` via the CLI flag, the benchmark runs still produce some data under the folder `benchmark_output`. The folders produced are `benchmark_output/dialect` a…
-
Are there any plans to release detailed performance metrics for individual tasks from BBH and MMLU? I think it could be very valuable for research to be able to look at those individual task performan…
-
### Your current environment
```text
The output of `python collect_env.py`
```
### 🐛 Describe the bug
vLLM has an issue where we can go OOM if too many `logprobs` are requested.
The reason t…
-
Hi!
There is a bug in the if-else statement that causes it to fail if the example is too long. `current_k_shot` may become `-1` in the `if` branch and on the next iteration `get_prompt_from_dataframe…
-
Currently, users of `deepeval` can only create their own evaluation dataset/test cases. To support more users fine-tuning their model, `deepeval` should be able to import standard benchmarks such as M…
-
Thanks for your great work! I clone the `math_eval` directory and run `run_7B_plus.sh` directly, and find some performance gaps in some datasets.
| Model | TheoremQA | GPQA |…
-
Loading datasets med_qa_open and the MMLU datasets does not work in source view.
(It is working in thoughtsource view, so generating CoTs etc is working.)