THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.01k stars 136 forks source link

OS std 测试集结果 #128

Open webdxq opened 3 months ago

webdxq commented 3 months ago
image

测试的os 结果文件中,几乎都没有“commit” 类别的结果,如果使用bash的能够正常执行结束作为回答正确的标准,很难保证能够是正确回答了原始的问题比如下面的情况

image

原始问题

As a student, you are given a directory named log_files containing log files from multiple servers. The log files are named as "server1.log", "server2.log", etc. Each log file contains a list of errors observed on that server.

The error messages have a specific format: a timestamp followed by an error code and error message, separated by colons. For example:

2022-02-28T10:30:23Z:ERR0001:Permission denied.
2022-02-28T10:31:42Z:ERR0003:Failed to connect to the database.

Your task is to calculate the total number of errors with the error code 'ERR0003' found in all log files present in the log_files directory.

The answer must be an integer representing the total count of the 'ERR0003' error code in all log files.

init bash

echo "2022-02-28T10:30:23Z:ERR0001:Permission denied." > log_files/server1.log
echo "2022-02-28T10:31:42Z:ERR0003:Failed to connect to the database." >> log_files/server1.log

echo "2022-02-28T10:40:12Z:ERR0002:Invalid input." > log_files/server2.log
echo "2022-02-28T10:45:19Z:ERR0003:Failed to connect to the database." >> log_files/server2.log
echo "2022-02-28T10:50:28Z:ERR0003:Failed to connect to the database." >> log_files/server2.log"
Longin-Yu commented 3 months ago

这个任务的设计是多轮的,如果操作为 bash,agent 会得到 os 的 output;理想的解决方案应该是第一轮 echo,第二轮 commit。

假想一个情景,如果让一个人来做这个任务,最终你的目标应该是让他告诉你答案,而不是仅仅在 terminal 中打印一个数字。