THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Apache License 2.0
2k stars 135 forks source link

Excellent Job! Well, no offense, it seems LLM-Bench rather than AgentBench in essence. #130

Open Konisberg opened 3 months ago

Konisberg commented 3 months ago

Sorry to raise the problem but give no systematic analysis It may be about to take me more time on more complete investigation over the "compression" ability of LLM as many may be support "compression is intelligence". In my view, the ability of Agents nowadays could hardly be termed "autonomous", meanwhile the prompting just guides the LLM to tell the humans what the LLM has compressed, which may be more proper to be termed the ability of LLM. The intelligence, in my opinion, is strongly connected to the saying revealed by physics and evolution and maybe the complex networks that "more is different". To be brief, "intelligence" == "more is different", based on a massive amount of data and others, structure and even "free will" emerges, which may be called intelligence by us, the individualities of an isomorphic networks. Agents shall be like us, and as Turing test revealed that the eight tasks may NOT represent the core abilities of the agents. Of course, the discussion above is NOT solid at all. If you take agents as tools, it's quite a different things lol. Back to the main topic, agents show more of the abilities of LLM nowadays and it's hard to distinguish Agents Benches from the LLM Benches. Welcome to discuss about it and I hope you can open the discussion section of the repo. Good luck. And Paper++

zhc7 commented 3 months ago

Hi, @Konisberg Thank you for your comment! It's an interesting idea. I think one of the purpose of this benchmark is to offer some truly challenging and real-world problems. As you may know, traditional QA or mutliple choice benchmarks sometimes might not be able to concretely reflect some models' true performance.

As for the topic of autonomous, intelligence and even free will, I believe we are still quite far from there right now. No one can define what exactly is the true intelligence. AgentBench can be a milestone but not a destination, there's a still long way to go.

We've opened a discussion section as suggested. Feel free to share more thoughts!