Open dmitrii-palisaderesearch opened 3 months ago
@lintangsutawika will #1741 fix this, do you think?
We're working on making groups more clear--namely, making a distinction between homogenous groups which will report their aggregated scores on a given metric, and heterogenous "groups" (--> tag
s) which are just convenience names / tags used when invoking a number of related tasks at once.
I'm not yet able to reproduce this. It seems to work fine with the latest version in main
.
lm-eval \
--model_args "pretrained=gpt2" \
--task openllm \
--limit 4 \
--device cpu \
--wandb_args project=xxx
I think --log_samples
may be required to reproduce?
Yes, it is. Sorry, missed it in the repro.
On Wed, Jun 12, 2024 at 8:14 PM Hailey Schoelkopf @.***> wrote:
I think --log_samples may be required to reproduce?
— Reply to this email directly, view it on GitHub https://github.com/EleutherAI/lm-evaluation-harness/issues/1958#issuecomment-2163434296, or unsubscribe https://github.com/notifications/unsubscribe-auth/BFOTULQF4S4W2QQZM7UM5B3ZHBXWTAVCNFSM6AAAAABJGCX2MSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGQZTIMRZGY . You are receiving this because you authored the thread.Message ID: @.***>
Still seems to work on from main
. @dmitrii-palisaderesearch are you using the latest main
?
lm-eval \
--model_args "pretrained=gpt2" \
--task openllm \
--limit 4 \
--device cpu \
--wandb_args project=xxx \
--log_samples --output test_wandb/
Hey, here are the repro colab notebooks:
lm-eval
CLI eats the exception and hides it under an INFO log entry, so it's a little hard to see—I added an lm_eval
lib call as well so you can see the exact line that throws and a traceback.
BTW thanks for your repro cmd, it's extremely helpful :)
Thanks, I see what the issue is now. It's a matter of not able to reconcile different data types that belong in the same column which can happen when calling a set of tasks that have different data types. I think the solution here is to not concatenate different tasks together. Unless this is actually disireable?
So it would be convenient to get samples from
mmlu_abstract_algebra
)OTOH, I don't feel like getting samples from "open llm leaderboard" will be useful: aggregate metrics suffice there.
This is probably what #1741 will do.
For the specific MMLU case, in order to support still having the full list logged perhaps we can have a flag in group configs that retains samples all together for logging, and otherwise not log groups' samples? @lintangsutawika do you think this seems too contrived?
For groups like MMLU, the wandb issue shouldn't occur since all subtasks share the same format.
On grouping samples, the bigger issue maybe how we log the results.json and samples.json file. I guess this means that it's not a quick fix but one that should suit long-term usability.
Btw @dmitrii-palisaderesearch , if you want the samples from mmlu tasks. Would running just mmlu
be suffice?
Sure, this works great. I just wanted to assemble my benchmark into one big yaml config and hit this.
Hi,
The wandb logger chokes if a group contains some tasks that output numbers and some that output strings. This is either a bug in
WandbLogger.log_eval_samples
or in theopenllm
group (maybe group tasks ought to be homogenous by design).Traceback
WandbLogger.log_eval_samples
concats tasks outputs into one big dataframe without converting types, and wandb balks at this.