Open dOrgJelli opened 7 months ago
More benchmarks to consider:
JungleGym, a set of open-source datasets and tools to test/build autonomous web agents. Test for categories like Travel, Shopping and Entertainment. Built by a a16z dev
This one is ran here https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard and has a leaderboard of best performing agents
Add the GAIA benchmarks to the repo, allowing us to gradually test each one and mark them as "functioning" so we can run regression tests against these in CI. https://huggingface.co/datasets/gaia-benchmark/GAIA https://arxiv.org/abs/2311.12983
NOTE: before doing this, we should make sure the GAIA benchmarks are the most aligned benchmarks with what we're trying to achieve, and see if better benchmarks may exist.