[BFCL] Refine Evaluation Metric for Multi Turn Categories

This PR contains three changes.

Adjustments for Irrelevance Detection in Multi Turn Categories

In #725, we updated our evaluation metric for multi-turn to be that:

During evaluation, if flag_task_unachievable is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn.
If the model calls flag_task_unachievable in a normal (non-irrelevant) turn, that turn will be marked incorrect.

This is an important change and can heavily affect the model's scores. However, we've identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call flag_task_unachievable, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the flag_task_unachievable has been removed.

Due to the removal of flag_task_unachievable, we’ve adjusted the evaluation criteria accordingly. Instead of checking if the model produces no output in a turn with missing function/parameter information, we now assess whether the model can perform correctly once the missing information is provided.

Response Checker Addition

To improve evaluation accuracy, we’ve added a response checker for all multi-turn categories, which works alongside the state checker:

State Checker: This checker evaluates the state of the instance after each turn, which we previously relied on exclusively. However, some functions (e.g., get_zipcode_by_city or estimate_distance) don’t directly alter the state, making it unclear whether the model actually invoked them.
Response Checker: The new checker compares the model’s execution result against the ground truth execution result, ensuring that the model’s result encompasses the ground truth (i.e., ground truth must be a strict subset of the model result).

With this addition, an entry will now only be marked correct if it passes both the state and response checkers.

Dataset Adjustments

A few dataset entries have been modified to align with these changes.

ShishirPatil / gorilla