allenai / open-instruct

Apache License 2.0
1.21k stars 166 forks source link

Script to collect metrics from Beaker eval job. #138

Closed dwadden closed 5 months ago

dwadden commented 6 months ago

Give it the ID of a beaker job launched with submit_eval_jobs.py and it will grab the results and dump in a json.

hamishivi commented 6 months ago

Could you add a copy-paste-friendly print in a configurable task order? Would be very useful to me!!

dwadden commented 6 months ago

Sure, what were you thinking exactly? Just print out the summary metrics I wrote in line 41, or all metrics? I can save as a pandas series, that's pretty easy to copy/paste.

hamishivi commented 5 months ago

Just printing out the summary metrics, or some easy to configure set of metrics. Basically just something that lets you copy to a row of something like the tulu results spreadsheet easily.

dwadden commented 5 months ago

Do you have an eval job where all tasks have successfully completed that I can use to debug? None of my recent alpaca-evals worked so I can't check this. This is only designed to work for jobs kicked off with an up-to-date submit_eval_jobs.py script that runs all evals as separate jobs in a single experiment, like this.

hamishivi commented 5 months ago

https://beaker.org/ex/01HV0P4E3MW9211HX0JEKM0PXM/tasks/01HV0P4E3T3BPZRPK7DYFJ80V3/job/01HV0P4FF8CY8TX053ZRF97G45 here ya go!

dwadden commented 5 months ago

Thanks! Oh also -- what summary metric should we use for ifeval?

hamishivi commented 5 months ago

Either loose or strict accuracy works for ifeval.

dwadden commented 5 months ago

OK, all set. I added the new evals (ifeval, alpaca, etc); they all seem covered now. You can now specify --table_file to dump summary metrics to a tsv, and / or --print_table to print to console. You can control which tasks are printed, and in what order, with --task_order.

There's a usage example in the script; give it a try and let me know if this does what you were thinking.