๐ Homepage | ๐ค Dataset | ๐ arXiv
Official repo for Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents (ACL 2024 Main Conference)
Authors: Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin.
We introduce ETO (Exploration-based Trajectory Optimization), an agent learning framework inspired by "trial and error" process of human learning. ETO allows an LLM agent to iteratively collect failure trajectories and updates its policy by learning from contrastive failure-success trajectory pairs.
ETO has following features:
There are three main folders in this project: envs
, eval_agent
, fastchat
envs
: the interaction environment of WebShop and ScienceWorld. We transform the original WebShop repo into a package.
eval_agent
: the evaluation framework of agent tasks, which is inspired by MINT.
fastchat
: training scripts for SFT and DPO, which is a modified version of FastChat.
bash setup.sh
The setup script performs the following actions:
You can also manually download the expert trajectories from Google Drive or Huggingface Datasets.
The bash script run_eto.sh
implements the ETO pipeline. For example, you can run:
# Optional tasks: webshop, sciworld, alfworld
bash run_eto.sh webshop <EXP_NAME> <YOUR_MODEL_PATH> <YOUR_SAVE_PATH>
The script performs the pipeline of ETO:
First, launch the controller of FastChat
python -m fastchat.serve.controller
Then, launch the model worker of FastChat
python -m fastchat.serve.model_worker --model-path <YOUR_MODEL_PATH> --port 21002 --worker-address http://localhost:21002
Finally, evaluate the agent
python -m eval_agent.main --agent_config fastchat --model_name <YOUR_MODEL_NAME> --exp_config <TASK_NAME> --split test --verbose
eval_agent/tasks
. You should implement the load_tasks
method which returns a task generator.eval_agent/envs
. The environment should parse the action generated by the LLM agent, execute the action, and return the observation. The tool/API calling should also be implemented in the environment.eval_agent/prompt
. The default setting is 1-shot evaluation.eval_agent/configs/task
. The config defines which task class and environment class to load, and the settings of the environment (e.g., max action steps).[
{
"id": "example_0",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
},
{
"from": "human",
"value": "Have a nice day!"
},
{
"from": "gpt",
"value": "You too!"
}
]
}
]
[
{
"id": "identity_0",
"prompt": [
{
"from": "human",
"value": "Hello"
},
{
"from": "gpt",
"value": "Hi"
},
{
"from": "human",
"value": "Have a nice day!"
}
],
"chosen": [
{
"from": "gpt",
"value": "OK!"
},
{
"from": "human",
"value": "How are you?"
},
{
"from": "gpt",
"value": "I'm fine"
}
],
"rejected": [
{
"from": "gpt",
"value": "No, I'm bad"
}
]
}
]
If you find this repo helpful, please cite out paper:
@article{song2024trial,
author={Yifan Song and Da Yin and Xiang Yue and Jie Huang and Sujian Li and Bill Yuchen Lin},
title={Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents},
year={2024},
eprint={2403.02502},
archivePrefix={arXiv},
primaryClass={cs.CL}
}