agentcoinorg / evo.ninja

A versatile generalist agent.
MIT License
1.05k stars 173 forks source link

Find the best research benchmarks #501

Open dOrgJelli opened 7 months ago

dOrgJelli commented 7 months ago

Add the GAIA benchmarks to the repo, allowing us to gradually test each one and mark them as "functioning" so we can run regression tests against these in CI. https://huggingface.co/datasets/gaia-benchmark/GAIA https://arxiv.org/abs/2311.12983

NOTE: before doing this, we should make sure the GAIA benchmarks are the most aligned benchmarks with what we're trying to achieve, and see if better benchmarks may exist.

rihp commented 7 months ago

More benchmarks to consider:

JungleGym, a set of open-source datasets and tools to test/build autonomous web agents. Test for categories like Travel, Shopping and Entertainment. Built by a a16z dev


This one is ran here https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard and has a leaderboard of best performing agents