ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.52k stars 1.01k forks source link

[BFCL] Miss Param score is very low #738

Open HumanMadeAgents opened 3 weeks ago

HumanMadeAgents commented 3 weeks ago

Hi, I used the latest BFCL code to evaluate ToolACE-8B and other models, and found that the Miss Param score was very low, for example, ToolACE-8B was only 2.5%.

Then I found that the common cause of the error in Miss param is Model outputs valid function calls when it should not for turn * , is there a problem with this evaluation method? Especially when this turn involves parallel function calls or the miss params can actually be found in the Context, such as :

multi_turn_miss_param_52

   [
      {
         "role":"user",
         "content":"I have secured my car by locking all doors and applying the parking brake. Would it be possible to start the engine so I can monitor the fuel level and battery status, ensuring smooth operation?"
      }
   ],
   [
      {
         "role":"user",
         "content":"I've initiated my vehicle, and I'd greatly appreciate it if you could take a look at the tire pressures. Should any tire display a pressure below this PSI value, kindly guide me to the nearest tire service facility for a fix."
      }
   ],
   [
      {
         "role":"user",
         "content":"I think below 40 PSI would need a fix."
      }
   ],
   [
      {
         "role":"user",
         "content":"Let's share the updates on tire status. Could you tweet 'Tires checked and engine purring smoothly!' using the hashtag #RoadTrip while tagging @AutoUpdates?"
      }
   ],
   [
      {
         "role":"user",
         "content":"Following the tweet, could you drop a comment saying, 'Safety first! Remember tire checks are crucial.' to emphasize tire care?"
      }
   ]
]

The possible_answer for turn 1 and turn 2 is [[], ["check_tire_pressure()", "find_nearest_tire_shop()"]], when the model predicts [["check_tire_pressure()"], [ "find_nearest_tire_shop()"]], is it correct? However, BFCL will directly think that the model prediction is wrong.

multi_turn_miss_param_25

[
   [
      {
         "role":"user",
         "content":"Examine current directory for any presence of the file named 'summary.txt'. If there is one, What's inside?"
      }
   ],
   [
      {
         "role":"user",
         "content":"Copy it into 'Research2023'."
      }
   ],
   [
      {
         " role":"user",
         "content":"Post review, organize the lines in a file alphabetically."
      }
   ],
   [
      {
         "role":"user",
         "content":"The file is 'summary.txt'."
      }
   ],
   [
      {
         "role":"user",
         "content":"Conclude by calculating the total lines in the sorted 'summary.txt' to verify that everything is meticulously arranged."
      }
   ]
]

The file name in turn 2 can be found in turn 0

HuanzhiMao commented 3 weeks ago

Hi @HumanMadeAgents,

Thanks for the issue.

The current evaluation method was introduced in #725, where we stated that

In the example you provided, in the turn that is supposed to have missing information, the model makes some exploratory function calls, but doesn't call the flag_task_unachievable at the end, which means that the model didn't realize that this task cannot be completely in full, and thus wrong for irrelevance detection.

This is an important change and can heavily affect the model's scores. As you pointed out, we've also identified that the model might be overly penalized under this setup. Many models tend to call this function frequently, even in base entries where it may not be appropriate. For instance, when a model encounters an execution error after an incorrect action, it might call flag_task_unachievable, assuming the task is unachievable. Without this flag, the model would sometimes continue to explore and arrive at the correct action sequence. So after careful consideration, the flag_task_unachievable restriction has been removed in #733.

Would love to hear your thoughts.