HowieHwong / MetaTool

[ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
MIT License
70 stars 8 forks source link

MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use

🌐 Dataset Website | πŸ“ƒ Paper | πŸ™‹ Welcome Contribution | πŸ“œ License

Introduction

We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. It includes:

ToolE Dataset

Dataset generation

We introduce the ToolE dataset with 21.1k diverse user queries related to tool usage. Each entry within the dataset comprises a user request (i.e., query) along with its corresponding tool name and tool description. These queries serve as triggers that prompt LLMs to utilize specific tools.

Dataset statistics

| Generation method | Model | Sample number | |--------------------------------|--------------------------------|-------------------------------------------------------| | Direct generation | ChatGPT, GPT-4 | 11,700 | | Emotional generation | ChatGPT | 7,800 | | Keyword generation | ChatGPT | 1,950 | | Details generation | ChatGPT | 7,800 | | Multi-tool generation | ChatGPT, GPT-4 | 1,624 | | After checking | \ | 21,127 (20,630 single-tool + 497 multi-tool) |

Dataset files

Evaluation Results

Tool usage awareness

Tool selection

Quick Start

First, create an .env file in the (put it next to src/generation/.example.env and include the same fields).

Now, you can run the following command for a quickstart (which downloads the model and prepares the data for you): bash quickstart.sh -m <model_name> -t <task>.

Alternatively, you can perform the below. Then, follow the results generation section.

Install the packages:

pip install --upgrade pip
pip install -r requirements.txt

Download the models:

Tool embedding

We use milvus to store tool embedding and conduct similarity searching.

To install and run milvus locally: https://milvus.io/docs/install_standalone-docker.md

Then run the following command to build a milvus database.

python src/embedding/milvus_database.py

Construct prompt data:

The pre-defined prompt templates are in src/prompt/prompt_template

If you want to generate the prompts for all tasks, run following command:

python src/prompt/prompt_construction.py

For single task prompts, run following command:

python src/prompt/prompt_construction.py [task]

Replace [task] with one of the following task options:

Generate the results:

Parameters

You can generate results by running the run.sh script. You may need to modify the running parameters within the run.sh file to suit your specific needs.

Troubleshooting

If you face an import error from Python, you may need to add this directory to your Python path:

# Add sys path
src_path="$(pwd)/src"
export PYTHONPATH="$PYTHONPATH:$src_path"

Citation

@article{huang2023metatool,
  title   = {MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use},
  author  = {Yue Huang and Jiawen Shi and Yuan Li and Chenrui Fan and Siyuan Wu and Qihui Zhang and Yixin Liu and Pan Zhou and Yao Wan and Neil Zhenqiang Gong and Lichao Sun},
  year    = {2023},
  journal = {arXiv preprint arXiv: 2310.03128}
}