aymeric-roucher / agent_reasoning_benchmark

🔧 Compare how Agent systems perform on several benchmarks. 📊🚀
Apache License 2.0
47 stars 5 forks source link

Benchmark agent workflows: try the models of your choice on the framework that you want

This repo is the engine for the evaluations displayed in our Agents v2.0 announcement post.

You can use it to test agents on different frameworks:

On different benchmarks:

And with different models (cf benchmark below).

We also implement LLM-judge evaluation, with parallel processing for faster results.

benchmark