camel-ai / camel

🐫 CAMEL: Finding the Scaling Law of Agents. A multi-agent framework. https://www.camel-ai.org
https://www.camel-ai.org
Apache License 2.0
5.24k stars 633 forks source link

[Feature Request] Integrate evaluation solution GAIA #640

Open Wendong-Fan opened 2 months ago

Wendong-Fan commented 2 months ago

Required prerequisites

Motivation

add benchmark for agents evaluation

Solution

  1. AgentEval: Automated testing and benchmarking for code generation agents.
  2. GAIA, a benchmark for General AI Assistants https://arxiv.org/abs/2311.12983

Alternatives

No response

Additional context

No response

Asher-hss commented 1 month ago

GAIA is a new benchmark designed to evaluate the capabilities of general AI assistants. GAIA presents real-world questions that require fundamental abilities such as reasoning, multi-modal handling, web browsing, and tool use. GAIA

Specifically, GAIA needs AI assistants to be able to solve tasks that are conceptually simple but require complex sequences of actions, just like ordinary people.

GAIA's design principles include simplicity of questions, interpretability, non-gameability, and ease of use.

I have attached the link to the dataset for the GAIA test questions. https://huggingface.co/gaia-benchmark

Wendong-Fan commented 1 month ago

GAIA is a new benchmark designed to evaluate the capabilities of general AI assistants. GAIA presents real-world questions that require fundamental abilities such as reasoning, multi-modal handling, web browsing, and tool use. GAIA

Specifically, GAIA needs AI assistants to be able to solve tasks that are conceptually simple but require complex sequences of actions, just like ordinary people.

GAIA's design principles include simplicity of questions, interpretability, non-gameability, and ease of use.

I have attached the link to the dataset for the GAIA test questions. https://huggingface.co/gaia-benchmark

Thanks @Asher-hss 's study!

Asher-hss commented 1 month ago

Hi guys, Microsoft's AgentEval framework is also a good option. Currently, Microsoft is integrating this framework into Autogen. Last week, I discovered several different AgentEval options, so I prioritized researching GAIA. I have been studying the latest version of Microsoft's AgentEval these past few days. This benchmark comprehensively evaluates LLM applications through three agents: CriticAgent, QuantifierAgent, and VerifierAgent.

https://microsoft.github.io/autogen/blog/2024/06/21/AgentEval https://microsoft.github.io/autogen/blog/2024/01/25/AutoGenBench/