Find the best research benchmarks

agentcoinorg / evo.ninja

A versatile generalist agent.

MIT License

1.05k stars 173 forks source link

More benchmarks to consider:

JungleGym, a set of open-source datasets and tools to test/build autonomous web agents. Test for categories like Travel, Shopping and Entertainment. Built by a a16z dev

https://twitter.com/Mascobot/status/1729561077724111275

This one is ran here https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard and has a leaderboard of best performing agents

🥅 📊 https://github.com/EleutherAI/lm-evaluation-harness/tree/master
Full list here https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md

agentcoinorg / evo.ninja

Find the best research benchmarks #501