Open steven-safeai opened 6 months ago
Hi!
Could you elaborate more about what the particular use case of defining both multiple tasks and multiple groups in a given YAML file would be?
Currently, we operate under the assumption that if a list of tasks is specified, that the YAML is being used to define a single new group/benchmark which will aggregate over that list of tasks. If a list of tasks is not specified, then the file is a task YAML and is only meant to define a single task for now.
Would setting things up similar to MMLU solve your use case, with one group containing a few subgroups and their subtasks? https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_mmlu.yaml
I ended up doing that (multiple files like mmlu) to fix the problem.
I viewed a group as a way to call evaluating tasks. A list of groups was a way to call this task or set of tasks from multiple different names. I didn't see anywhere in the docs that you were only supposed to have either a group list or a task list but not both hence I made this issue. If this intended behavior happy to make a PR to update the docs with the implicit assumptions made explicit.
I wanted it mostly for convenience to create this task list itself as a group and be able to call this task list from multiple different group names ala decoding_trust
or dt_subset
Makes sense!
A docs update would be greatly appreciated! We are planning on moving toward differentiating "groups"/tags (a convenient name/shortcut to call a group of tasks) and "groups" / benchmarks which require a separate YAML file and should explicitly define how to aggregate a single score over a task grouping, which I hope will add clarity here too.
@haileyschoelkopf
If I have firewall issues, and want to load the dataset from disk, not from online downloading, where should I use the load_from_disk
function from datasets lib
, in lm_eval
codes?
Hi @vovoluck , this seems like a separate issue from the one here. However, Task
(and particularly ConfigurableTask
objects) have a download()
method, which you could edit to use load_from_disk
to your liking.
I think this is a bug. I'm trying to add decoding trust like so
However I encounter an error here: