EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.19k stars 1.64k forks source link

Update Zeno Integration #1175

Open haileyschoelkopf opened 8 months ago

haileyschoelkopf commented 8 months ago

There are a few things we'll need to do to fix edge cases in our Zeno integration.

cc @lintangsutawika for your awareness because changing how metrics / aggregations are computed to unify them might harm Zeno's reliance on per-example metrics.

cc @Sparkier !

Sparkier commented 8 months ago

Regarding tests, we could run a test on our end. E.g., we have integration tests set up where we have a project created on push. Would not alert you if anything breaks, though. I can also help set up a test that uploads a project to Zeno in your codebase if you want to.

Sparkier commented 8 months ago

Any other changes on our end to beautify Zeno projects created?

We've seen some patterns in useful metadata recently. For example, the model output length in freeform answers is often interesting. We could think about additional metadata that would make sense per task to further enhance the created Zeno project.

Sparkier commented 7 months ago

For tests, see https://github.com/EleutherAI/lm-evaluation-harness/pull/1221

Sparkier commented 7 months ago

For additional metadata, see https://github.com/EleutherAI/lm-evaluation-harness/pull/1222