OpenBMB / ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
https://openbmb.github.io/ToolBench/
Apache License 2.0
4.77k stars 402 forks source link

论文实验中I1_Inst、I1_tool I1_cat,请问什么意思 #212

Open shyoulala opened 10 months ago

shyoulala commented 10 months ago

你好,请问实验过程中的横坐标的数据区别是什么比如I1_Inst、I1_tool I1_cat image

MikuAndRabbit commented 9 months ago

The meaning of I1, I2 and I3 is mentioned in 2.2 Instruction Generation in the paper.

Sampling Strategies for Different Scenarios As shown in Figure 3, for the single-tool instructions (I1), we iterate over each tool and generate instructions for its APIs. However, for the multi-tool setting, since the interconnections among different tools in RapidAPI are sparse, random sampling tool combinations from the whole tool set often leads to a series of irrelevant tools that cannot be covered by a single instruction in a natural way. To address the sparsity issue, we leverage the RapidAPI hierarchy information. Since tools belonging to the same RapidAPI category or collection are generally related to each other in the functionality and goals, we randomly select 2-5 tools from the same category / collection and sample at most 3 APIs from each tool to generate the instructions. We denote the generated instructions as intra-category multi-tool instructions (I2) and intra-collection multi-tool instructions (I3), respectively. Through rigorous human evaluation, we find that instructions generated in this way already have a high diversity that covers various practical scenarios. We also provide visualization for instructions using Atlas (link) to support our claim.

The meaning of Inst., Tool and Cat. is mentioned in 3.2 Main Experiments in the paper.

Settings Ideally, by scaling the number and diversity of instructions and unique tools in the training data, ToolLLaMA is expected to generalize to new instructions and APIs unseen during training. This is meaningful since users can define customized APIs and expect ToolLLaMA to adapt according to the documentation. To this end, we strive to evaluate the generalization ability of ToolLLaMA at three levels: (1) Inst.: unseen instructions for the same set of tools in the training data, (2) Tool: unseen tools that belong to the same (seen) category of the tools in the training data, and (3) Cat.: unseen tools that belong to a different (unseen) category of tools in the training data.