Open HumanMadeAgents opened 3 weeks ago
Hi @HumanMadeAgents,
Thanks for the issue.
The current evaluation method was introduced in #725, where we stated that
flag_task_unachievable
is called in a turn, that turn will be marked correct for irrelevance detection, even if other functions were also called in that turn. flag_task_unachievable
in a normal (non-irrelevant) turn, that turn will be marked incorrect.In the example you provided, in the turn that is supposed to have missing information, the model makes some exploratory function calls, but doesn't call the flag_task_unachievable
at the end, which means that the model didn't realize that this task cannot be completely in full, and thus wrong for irrelevance detection.
This is an important change and can heavily affect the model's scores. As you pointed out, we've also identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call flag_task_unachievable
, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the flag_task_unachievable
restriction has been removed in #733.
Would love to hear your thoughts.
Hi, I used the latest BFCL code to evaluate ToolACE-8B and other models, and found that the Miss Param score was very low, for example, ToolACE-8B was only 2.5%.
Then I found that the common cause of the error in Miss param is
Model outputs valid function calls when it should not for turn *
, is there a problem with this evaluation method? Especially when this turn involves parallel function calls or the miss params can actually be found in the Context, such as :multi_turn_miss_param_52
The possible_answer for turn 1 and turn 2 is
[[], ["check_tire_pressure()", "find_nearest_tire_shop()"]]
, when the model predicts[["check_tire_pressure()"], [ "find_nearest_tire_shop()"]]
, is it correct? However, BFCL will directly think that the model prediction is wrong.multi_turn_miss_param_25
The file name in turn 2 can be found in turn 0