Closed ZeratuuLL closed 10 months ago
Thank you for your attention. In fact, considering the diversity and the complexity of tools learning, ToolEyes builds a capability evaluation system without standardized paths, details of which can be found in the paper.
Thank you for your quick response Yunjie! I read the paper again, and not sure how to evaluate whether the models are using the correct tools.
In Tool Selection
, it seems like we only check if the tools are called as they are documented, and whether the reasoning should lead to the chosen tools. But it doesn't check if the tool parameters (if any) are correct (but only valid)? I don't think it's covered in Answer Organization
either, unless we have a very accurate way to evaluate "the accuracy of the information conveyed", and understand why inaccurate information would show up...
If you don't mind, could you please share more insights? I do find this dataset in nice settings and would like to use it for our work if we can figure out whether the evaluation is reliable. Thank you!
We understand your confusion. Indeed, since there is no single prescribed method for problem-solving in tool learning – such as when LLMs invoke a tool, encounter a failure, and then correctly re-invoke the tool – these individual steps may not significantly impact the overall problem-solving process. Therefore, we advocate for assessing their capabilities based on the entire process.
To determine the reasonableness of their overall tool invocations, we focus on four key aspects:
Tool and Parameter Validity: We confirm the appropriateness of the tools they select by scrutinizing the compatibility between the tools and their respective parameters. This validation is essential to ensure correct tool selection.
Alignment with Logical Thinking: We evaluate whether the chosen tool aligns with the logical approach to solving the intended problem. This assessment helps us gauge if the tool choice is consistent with the problem-solving logic.
Coherence of Thought Process: We assess the quality and consistency of the overall thought process by examining whether it centers on the user's needs and whether it demonstrates reasonableness and effectiveness.
Correspondence between the Answer and the Query: To indirectly gauge the correctness of the entire process, we assess the quality of the final response, which can reflect the overall correctness of the process.
In conclusion, we encourage you to look beyond the correctness or incorrectness of individual tool invocations and instead focus on a comprehensive evaluation of the entire process. This approach allows for a more realistic assessment of LLMs' ability to effectively utilize tools in real-life scenarios. We hope this perspective proves helpful to you.
Ah got it! Yes, these points are mentioned in the paper's evaluation system. I am asking those questions because I am working on a different topic (more production related) where LLMs need to collect and filter information and make the right call in just one inference call. That's why I need a final "correct" tool call and ignores the middle exploration steps. Seems like I will need to wait for the evaluation code to see if I can make use of this dataset. Thank you very much!
I only find the query and path to the tools, but no labels (what are the final correct calls). Could you please also share this? Thanks!
With this we can start using this dataset without the inference code or evaluation code.