This repo is the engine for the evaluations displayed in our Agents v2.0 announcement post.
You can use it to test agents on different frameworks:
On different benchmarks:
And with different models (cf benchmark below).
We also implement LLM-judge evaluation, with parallel processing for faster results.