THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.03k stars 138 forks source link

How to test in self customed data? #83

Closed Reason-Wang closed 7 months ago

Reason-Wang commented 7 months ago

With the following code snippet, I got the output like this:

result = requests.post(
    "http://xxx.xx.xx.xx:5000/api/start_sample",
    json=StartSampleRequest(name="os-std", index="std-007-bootstrap-00001").dict()
)
print(result.json())
>>> {'session_id': 30, 'output': {'index': None, 'status': 'running', 'result': None, 'history': [{'role': 'user', 'content': 'You are an assistant that will act like a person, I\'will play the role of linux(ubuntu) operating system. Your goal is to implement the operations required by me or answer to the question proposed by me. For each of your turn, you should first think what you should do, and then take exact one of the three actions: "bash", "finish" or "answer". \n\n1. If you think you should execute some bash code, take bash action, and you should print like this:\n\nThink: put your thought here.\n\nAct: bash\n\n```bash\n# put your bash code here\n```\n\n2. If you think you have finished the task, take finish action, and you should print like this:\n\nThink: put your thought here.\n\nAct: finish\n\n3. If you think you have got the answer to the question, take answer action, and you should print like this:\n\nThink: put your thought here.\n\nAct: answer(Your answer to the question should be put in this pair of parentheses)\n\nIf the output is too long, I will truncate it. The truncated output is not complete. You have to deal with the truncating problem by yourself. Attention, your bash code should not contain any input operation. Once again, you should take only exact one of the three actions in each turn.\n\nNow, my problem is:\n\ntell me how many files are in the directory "/etc"?'}, {'role': 'agent', 'content': 'Think: To count the files in /etc, I need to print all the files in it.\n\nAct: bash\n\n```bash\nls /etc\n```'}, {'role': 'user', 'content': 'The output of the OS:\ncpi cron.hourly fuse.conf iproute2 lvm networkd-dispatcher protocols selinux tmpfiles.d [truncated because the output is too long]'}, {'role': 'agent', 'content': 'Think: The output has been truncated because it is too long, so I need to count files by script directly.\n\nAct: bash\n\n```bash\nls -1 /etc | wc -l\n```'}, {'role': 'user', 'content': 'The output of the OS:\n220'}, {'role': 'agent', 'content': 'Think: Now I get the answer, it is 220.\n\nAct: answer(220)'}, {'role': 'user', 'content': "Now, I will start a new problem in a new OS. My problem is:\n\nIn this problem, your task is to find the total count of processes that are currently running on the student's Linux (Ubuntu) operating system having 'bash' in their command name or path. The answer must be an integer."}]}}

It seems that I need to input a index and the system would test the pre-defined example corresponding to that index. Is there any way to test in new examples. For example, will simply replacing the data in data folder works? If that works, how should I set the index (i.e. what is the naming rule for the index correspoing to an example).

Also, a better way was like I can dirrectly set the question in start sample request like this:

result = requests.post(
    "http://xxx.xx.xx.xx:5000/api/start_sample",
    json={"name"="os-std", "question": "In this problem, your task is to ..."}
)

Is there any way to support this?

zhc7 commented 7 months ago

Hi, @Reason-Wang , First of all, I would like to clarify that this API is not designed to be requested manually.

It seems that I need to input a index and the system would test the pre-defined example corresponding to that index. Is there any way to test in new examples. For example, will simply replacing the data in data folder works? If that works, how should I set the index (i.e. what is the naming rule for the index correspoing to an example).

There's several ways to test custom data. If you wish to set up a new task setting, you may implement the Task class in src/server/task.py and start a new task worker. Or if you just want to add more data to an existing task, simply replacing the data in data folder will work, the indexing rule can be found in the corresponding task definition code, for example, here.

Also, a better way was like I can dirrectly set the question in start sample request like this:

result = requests.post(
    "http://xxx.xx.xx.xx:5000/api/start_sample",
    json={"name"="os-std", "question": "In this problem, your task is to ..."}
)

Is there any way to support this?

As mentioned before, This API is designed to be called by assigner. It's more convenient this way for pogram to identify which question has been tested and which hasn't.