ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.52k stars 1.01k forks source link

[BFCL] Refine Evaluation Metric for Multi Turn Categories #733

Closed HuanzhiMao closed 3 weeks ago

HuanzhiMao commented 4 weeks ago

This PR contains three changes.

Adjustments for Irrelevance Detection in Multi Turn Categories

In #725, we updated our evaluation metric for multi-turn to be that:

This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call flag_task_unachievable, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the flag_task_unachievable has been removed.

Due to the removal of flag_task_unachievable, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided.

Response Checker Addition

To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker:

With this addition, an entry will now only be marked correct if it passes both the state and response checkers.

Dataset Adjustments

A few dataset entries have been modified to align with these changes.