THUDM / AgentTuning

AgentTuning: Enabling Generalized Agent Abilities for LLMs
https://thudm.github.io/AgentTuning/
1.36k stars 95 forks source link

AgentTuning: Enabling Generalized Agent Abilities For LLMs

๐Ÿค— Model (AgentLM-70B) โ€ข ๐Ÿค— Dataset (AgentInstruct) โ€ข ๐Ÿ“ƒ Paper โ€ข ๐ŸŒ Project Page

main-figure

ไธญๆ–‡็‰ˆ(Chinese)

AgentTuning represents the very first attempt to instruction-tune LLMs using interaction trajectories across multiple agent tasks. Evaluation results indicate that AgentTuning enables the agent capabilities of LLMs with robust generalization on unseen agent tasks while remaining strong in general language abilities. We have open-sourced the AgentInstruct dataset and AgentLM.

Main Result

head-figure
Figure 1  Overall score in our held-in and held-out tasks

AgentInstruct

AgentInstruct is a meticulously curated dataset featuring 1,866 high-quality interactions designed to enhance AI agents across 6 diverse real-world tasks.

AgentInstruct dataset is available on ๐Ÿค—Huggingface Repo.

AgentLM

AgentLM models are produced by mixed training on AgentInstruct dataset and ShareGPT dataset from Llama2-chat series.

The models follow the conversation format of Llama-2-chat, with the system prompt fixed as You are a helpful, respectful and honest assistant.

The 7B, 13B, and 70B models are available on Huggingface model hub.

Model Huggingface Repo
AgentLM-7B ๐Ÿค—Huggingface Repo
AgentLM-13B ๐Ÿค—Huggingface Repo
AgentLM-70B ๐Ÿค—Huggingface Repo

Run AgentLM

We use Text-Generation-Inference to accelerate the evaluation process.

You can start a AgentLM-70b instance with:

cd docker
docker compose -f agentlm-70b.yml up

Upon successful execution, a client will be available on port 30070. Here is an example of launching a request:

curl 127.0.0.1:30070/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"inputs": "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant.\n<</SYS>>\n\nHello! [/INST]", "parameters":{"temperature": 1.0}}'

# {"generated_text":"Hello! How can I help you today? "}

You can replicate the services in the Docker Compose file to create multiple inference instances if more GPUs are available.

Evaluation

Here are details of our evaluation task, including 6 held-in tasks and 6 held-out tasks.

Held-in Tasks

The 6 held-in tasks are selected from AgentBench. However, since AgentBench is still under active development, the results from the latest branch might not fully reproduce the results reported in the paper. The evaluation code of this project is located in ./AgentBench.old.

Held-out Tasks

Held-out tasks are recompiled from the following frameworks:

Task AgentTuning Setup Original Repo
SciWorld ๐Ÿ“‚ eval_heldout/science-world ๐Ÿ’ป allenai/ScienceWorld
MiniWoB++ ๐Ÿ“‚ eval_heldout/miniwob++ ๐Ÿ’ป Farama-Foundation/miniwob-plusplus
HotpotQA ๐Ÿ“‚ eval_heldout/hotpotQA ๐Ÿ’ป salesforce/BOLAA
ReWOO ๐Ÿ“‚ eval_heldout/rewoo ๐Ÿ’ป billxbf/ReWOO
WebArena ๐Ÿ“‚ eval_heldout/webarena ๐Ÿ’ป web-arena-x/webarena
Digital Card Game ๐Ÿ’ป AgentBench.old ( Extend Split ) ๐Ÿ’ป THUDM/AgentBench

General Tasks

MMLU Setup:

GSM8k Setup:

MT-Bench Setup:

Citation

If you find our work useful, please consider citing AgentTuning:

@misc{zeng2023agenttuning,
      title={AgentTuning: Enabling Generalized Agent Abilities for LLMs},
      author={Aohan Zeng and Mingdao Liu and Rui Lu and Bowen Wang and Xiao Liu and Yuxiao Dong and Jie Tang},
      year={2023},
      eprint={2310.12823},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}