THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.03k stars 138 forks source link

init commit avalon #60

Closed HenryCai11 closed 8 months ago

HenryCai11 commented 9 months ago

AvalonBench

Quick Start

Start the task server and the assigner

Start the game (3 is the number of workers)

python -m src.start_task -a --start avalon-dev-single 3

Start the assigner

python -m src.assigner --config ./configs/assignments/test_avalon.yaml

Customize configurations and data

  1. You can modify the file configs/tasks/avalon.yaml to configure the agent list. A config file looks like this:
    
    default:
    module: "src.server.tasks.avalon.AvalonBench"
    parameters:
    num_players: 5
    discussion: False

avalon-dev-naive: parameters: name: "AvalonBench-dev-naive" data_file: "data/avalon/dev.json" agent_list: ["naive", "naive", "naive", "naive", "naive"]

avalon-dev-single: parameters: name: "AvalonBench-dev-single" data_file: "data/avalon/dev.json" agent_list: ["llm", "naive", "naive", "naive", "naive"]

where `naive` stands for the naive bots. Agents will play the roles with the same index in the data file (see following).
```plaintext
Note: There should only be one "llm" in the `agent_list`
  1. You can also add data in data/avalon/dev.json (Note: Currently we only support the 5-player game setting, which includes 1 Merlin, 2 Servants, 1 Minion and 1 Assassin). A data item looks like this:
 {
     "num_players": 5,
     "quest_leader": 0,
     "role_names": ["Assassin", "Servant", "Servant", "Merlin", "Minion"]
 }

where quest_leader is the id of the initial quest leader in this game. You can change the game setup by altering quest_leader with number from 0 to 4, and by permuting role_names.

Naive experiment

You can also start a naive experiment using:

python -m src.start_task -a --start avalon-dev-naive 3

where all the agents are naive bots. For details of the naive strategies, please refer to the paper.

Prompts

All the prompts are maintained in src/server/tasks/avalon/prompt.py. You can find the respective prompts in src/server/tasks/avalon/agents/llm_with_discussion.py and src/server/tasks/avalon/wrapper.py.

Results

Results of single-setting games

{
    "total": 20,
    "validation": {
        "running": 0.0,
        "completed": 0.95,
        "agent context limit": 0.0,
        "agent validation failed": 0.05,
        "agent invalid action": 0.0,
        "task limit reached": 0.0,
        "unknown": 0.0,
        "task error": 0.0,
        "average_history_length": 11.0,
        "max_history_length": 14,
        "min_history_length": 2
    },
    "custom": {
        "Win rate of Player 0": 0.15,
        "Avg deduction acc of Player 0": 0.5399999999999998,
        "Valid number of games": 19
    }
}