EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.79k stars 1.54k forks source link

Add tasks for performance on long context lengths #1748

Open nairbv opened 2 months ago

nairbv commented 2 months ago

There are a couple of papers I see with benchmarks for really long context lengths that don't seem to be available in lm-evaluation-harness. It would be great to have one of these or something similar for measuring ability to extract information from long context windows, important for RAG.

haileyschoelkopf commented 2 months ago

Needle-in-a-haystack might also be a nice-to-have though I think more difficult / "natural" long-context evals should be prioritized.