请问在数据构建过程中，是怎么评估solution path的质量的？

OpenBMB / ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.

Apache License 2.0

4.62k stars 397 forks source link

我在使用gpt-4-turbo生成solution path的时候，工具调用是正常的，但是推理过程错了，导致最终答案是错的。比如： query: 下周从上海去北京，帮忙订一下飞机票 gpt4: Thought：需要获取当前时间，来判断下周的日期。Action：getCurrentDate ActionInput: "" function: 2023-12-13（周三） gpt4: 今天是2023-12-13，下周就是 2023-12-20，已知时间和出发地和目的地，开始订机票 ....... gpt4: 查询到航班号：xxx，起飞时间 2023-12-20 12:45 上海虹桥机场 ....

上述例子中时间推理是错误的，应该是询问用户的出发时间，而不是直接+7天，这种类似情况，整个过程是没有问题的，但是最终结果不对。

OpenBMB / ToolBench

请问在数据构建过程中，是怎么评估solution path的质量的？ #216