THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.21k stars 157 forks source link

Custom task or test set #44

Closed mahmoudialireza closed 1 year ago

mahmoudialireza commented 1 year ago

Hello Team Is it possible to create a customized test set for a specific task (for example for medical or financial) and use this tool to evaluate fine tune models? Thanks in advance.

zhc7 commented 1 year ago

Yes. We are working on a newer version of the framework for easier deployment and extension.

mahmoudialireza commented 1 year ago

Thanks. For now how can I do that? IS there any documentation or instruction for that?

Xiao9905 commented 1 year ago

@mahmoudialireza Hi, thanks for your interest in AgentBench. Our new version v0.2 has been updated to the repo and please take a look at our readme to find the new documentation. Feel free to reopen this issue if you need further help.

chiyuzhang94 commented 5 months ago

Hi @Xiao9905 @zhc7 ,

I wonder how to add new task into AgentBench. Could you point me where the guide for adding new dataset/task?

Thanks.

zhc7 commented 5 months ago

Hi @chiyuzhang94 I just added the guide: https://github.com/THUDM/AgentBench/blob/main/docs/Extension_en.md

chiyuzhang94 commented 5 months ago

Hi @chiyuzhang94 I just added the guide: https://github.com/THUDM/AgentBench/blob/main/docs/Extension_en.md

Hi @zhc7 ,

I have prepared the task.py script for my new task. I have created a new yaml file for my new task, added new task yaml to "task_assembly.yaml", and tried to run the task with python -m src.assigner --config configs/assignments/new_task but I got this error:

  File "/home/AgentBench/src/typings/config.py", line 83, in post_validate
    assert (
AssertionError: Task new_task is not defined.

I wonder where I should define the task.

zhc7 commented 5 months ago

Hi @chiyuzhang94 , your steps are correct. Can you show me your config yamls? Including task_assembly.yaml and all the yaml you added or modified.

chiyuzhang94 commented 5 months ago

Hi @chiyuzhang94 , your steps are correct. Can you show me your config yamls? Including task_assembly.yaml and all the yaml you added or modified.

Thanks. Here they are

AgentBench/configs/assignments/spam_email.yaml

import: definition.yaml

concurrency:
  task:
    spam_email: 5
  agent:
    gpt-3.5-turbo: 5

assignments: # List[Assignment] | Assignment
  - agent: # "task": List[str] | str ,  "agent": List[str] | str
      - gpt-3.5-turbo
    task:
      - spam_email

output: "outputs/{TIMESTAMP}"

AgentBench/configs/tasks/spam_email.yaml

default:
  module: src.server.tasks.spam_email.SpamEmail
  parameters:
    data_path: "data/spam_email/"
    max_step: 5

task_assembly.yaml

default:
  docker:
    command: umask 0; [ -f /root/.setup.sh ] && bash /root/.setup.sh;

import:
  - webshop.yaml
  - dbbench.yaml
  - mind2web.yaml
  - card_game.yaml
  - kg.yaml
  - os.yaml
  - ltp.yaml
  - alfworld.yaml
  - avalon.yaml
  - spam_email.yaml
zhc7 commented 5 months ago

So the problem here is that you didn't actually defined the task. What you have to do is to change default to spam_email in configs/tasks/spam_email.yaml. The logic here is that the assignment config need to have task definitions from task asembly, which imports all the tasks from different configs. You can view import as something like include in C, which is more like a copy of the imported file regardless of the file name. The reason why all configs we provided have a default field is that there are actually several tasks within one file that share some fields. More information can be found in docs/Config_en.md. I hope this solves your problem! @chiyuzhang94

chiyuzhang94 commented 5 months ago

Thanks for the prompt reply. This solved the issue.

chiyuzhang94 commented 5 months ago

Hi @zhc7 ,

I have a question about how to debug. I found that it is hard to interactively debug due to use of multi-process and server. I wonder if you have any experience or suggestions to debug and investigate the outputs in task scripts (e.g., task.py in AgentBench/src/server/tasks/xxxx/).

Thanks.

zhc7 commented 5 months ago

Hi @zhc7 ,

I have a question about how to debug. I found that it is hard to interactively debug due to use of multi-process and server. I wonder if you have any experience or suggestions to debug and investigate the outputs in task scripts (e.g., task.py in AgentBench/src/server/tasks/xxxx/).

Thanks.

I assume you mean something like attaching a debugger to the process right? I suggest first you set the number of processes to 1. Then you may start a task worker manually which you can attach a debugger. Also, you may add some printings or assertions in your task file to see if everything is working as expected.