Update Zeno Integration

haileyschoelkopf commented 8 months ago

There are a few things we'll need to do to fix edge cases in our Zeno integration.

[ ] log info on the groups tasks may or may not have been called as part of, as well as which tasknames are themselves groups, as part of the results.json file we save
[ ] Ensure Winograd Schema / minimal pair -like tasks (Winogrande, BLIMP, CrowS-Pairs) are handled intelligently in the Zeno export script
[ ] Any other changes on our end to beautify Zeno projects created?
[ ] Any tests we can set up to ensure Zeno support is not broken?

cc @lintangsutawika for your awareness because changing how metrics / aggregations are computed to unify them might harm Zeno's reliance on per-example metrics.

cc @Sparkier !

Sparkier commented 8 months ago

Regarding tests, we could run a test on our end. E.g., we have integration tests set up where we have a project created on push. Would not alert you if anything breaks, though. I can also help set up a test that uploads a project to Zeno in your codebase if you want to.

Sparkier commented 8 months ago

Any other changes on our end to beautify Zeno projects created?

We've seen some patterns in useful metadata recently. For example, the model output length in freeform answers is often interesting. We could think about additional metadata that would make sense per task to further enhance the created Zeno project.

Sparkier commented 7 months ago

For tests, see https://github.com/EleutherAI/lm-evaluation-harness/pull/1221

Sparkier commented 7 months ago

For additional metadata, see https://github.com/EleutherAI/lm-evaluation-harness/pull/1222

EleutherAI / lm-evaluation-harness

Update Zeno Integration #1175