ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.18k stars 919 forks source link

Questions about the evaluation criteria. #513

Closed GeniusYx closed 1 month ago

GeniusYx commented 2 months ago

Hi, I have a few cases that are confusing:

All cases in simple Python. Case 1: I think the description of the location is just giving an example. The parameter is not an enum type. { "id": 159, "model_name": "NA", "test_category": "simple", "valid": false, "error": [ "Invalid value for parameter 'location': 'New York'. Expected one of ['New York, NY']. Case insensitive." ], "error_type": "value_error:string", "prompt": { "question": "Find the information of criminal cases of Mr. X in New York between 2012 and 2015.", "function": { "name": "get_criminal_records", "description": "Retrieve the criminal records of a specific person in a specific area during a certain time period.", "parameters": { "type": "dict", "properties": { "name": { "type": "string", "description": "The name of the person." }, "location": { "type": "string", "description": "The city and state, e.g. New York, NY" }, "from_year": { "type": "integer", "description": "The start year of the time frame." }, "to_year": { "type": "integer", "description": "The end year of the time frame." } }, "required": [ "name", "location", "from_year", "to_year" ] } } }, "model_result_raw": [ { "get_criminal_records": { "name": "Mr. X", "location": "New York", "from_year": 2012, "to_year": 2015 } } ], "model_result_decoded": [ { "get_criminal_records": { "name": "Mr. X", "location": "New York", "from_year": 2012, "to_year": 2015 } } ], "possible_answer": { "get_criminal_records": { "name": [ "Mr. X" ], "location": [ "New York, NY" ], "from_year": [ 2012 ], "to_year": [ 2015 ] } } }

Case2: Same as case1. { "id": 167, "model_name": "NA", "test_category": "simple", "valid": false, "error": [ "Invalid value for parameter 'city': 'Chicago'. Expected one of ['Chicago, IL.', 'Chicago, IL']. Case insensitive." ], "error_type": "value_error:string", "prompt": { "question": "Find a lawyer specializing in divorce cases and charge fee less than 400 dollars per hour in Chicago.", "function": { "name": "lawyer.find_nearby", "description": "Locate nearby lawyers based on specific criteria like specialty, fee per hour and city.", "parameters": { "type": "dict", "properties": { "city": { "type": "string", "description": "The city and state, e.g. Chicago, IL." }, "specialty": { "type": "array", "items": { "type": "string", "enum": [ "Civil", "Divorce", "Immigration", "Business", "Criminal" ] }, "description": "Specialization of the lawyer." }, "fee": { "type": "integer", "description": "Hourly fee charged by lawyer" } }, "required": [ "city", "specialty", "fee" ] } } }, "model_result_raw": [ { "lawyer_find_nearby": { "city": "Chicago", "specialty": [ "Divorce" ], "fee": 400 } } ], "model_result_decoded": [ { "lawyer_find_nearby": { "city": "Chicago", "specialty": [ "Divorce" ], "fee": 400 } } ], "possible_answer": { "lawyer_find_nearby": { "city": [ "Chicago, IL.", "Chicago, IL" ], "specialty": [ [ "Divorce" ] ], "fee": [ 400 ] } } }

Case 3:Why do we need to rewrite time? { "id": 383, "model_name": "NA", "test_category": "simple", "valid": false, "error": [ "Invalid value for parameter 'start_date': '10th December 2022'. Expected one of ['2022-12-10', '10/12/2022', 'Dec 10, 2022', 'December 10, 2022']. Case insensitive." ], "error_type": "value_error:string", "prompt": { "question": "Book a single room for two nights at the Hilton Hotel in Chicago, starting from 10th December 2022.", "function": { "name": "book_hotel", "description": "Book a room of specified type for a particular number of nights at a specific hotel, starting from a specified date.", "parameters": { "type": "dict", "properties": { "hotel_name": { "type": "string", "description": "The name of the hotel." }, "location": { "type": "string", "description": "The city in which the hotel is located." }, "room_type": { "type": "string", "description": "The type of room to be booked." }, "start_date": { "type": "string", "description": "The start date for the booking." }, "nights": { "type": "integer", "description": "The number of nights for which the booking is to be made." } }, "required": [ "hotel_name", "location", "room_type", "start_date", "nights" ] } } }, "model_result_raw": [ { "book_hotel": { "hotel_name": "Hilton Hotel", "location": "Chicago", "room_type": "single", "start_date": "10th December 2022", "nights": 2 } } ], "model_result_decoded": [ { "book_hotel": { "hotel_name": "Hilton Hotel", "location": "Chicago", "room_type": "single", "start_date": "10th December 2022", "nights": 2 } } ], "possible_answer": { "book_hotel": { "hotel_name": [ "Hilton Hotel", "Hilton" ], "location": [ "Chicago" ], "room_type": [ "single" ], "start_date": [ "2022-12-10", "10/12/2022", "Dec 10, 2022", "December 10, 2022" ], "nights": [ 2 ] } } }

Case 4:For optional parameters, are there criteria for when inference must be made. { "id": 152, "model_name": "NA", "test_category": "simple", "valid": false, "error": [ "Optional parameter 'top_n' not provided and not marked as optional." ], "error_type": "simple_function_checker:missing_optional", "prompt": { "question": "Find the highest grossing bank in the U.S for year 2020.", "function": { "name": "highest_grossing_banks", "description": "Retrieve the highest grossing banks in a specified country and year.", "parameters": { "type": "dict", "properties": { "country": { "type": "string", "description": "The country to get the data from." }, "year": { "type": "integer", "description": "The year to get the data from." }, "top_n": { "type": "integer", "description": "Top n banks in terms of grossing. Default is 5" } }, "required": [ "country", "year" ] } } }, "model_result_raw": [ { "highest_grossing_banks": { "country": "U.S.", "year": 2020 } } ], "model_result_decoded": [ { "highest_grossing_banks": { "country": "U.S.", "year": 2020 } } ], "possible_answer": { "highest_grossing_banks": { "country": [ "U.S", "United States", "USA", "U.S." ], "year": [ 2020 ], "top_n": [ 1 ] } } }

HuanzhiMao commented 1 month ago

Regarding your first and second cases:

Enum types are suitable when the number of available options is small and limited, such as a list of 10 distinct choices. While it's technically possible to create an enum listing all possible city, state pairings, this would make the function documentation excessively long and difficult to read.

In this case, the phrase "New York, NY" in the description ("The city and state, e.g. New York, NY") is intended to illustrate the expected format of the parameter. It indicates that the parameter should follow the city, state format.

For example, if the ground truth is "LA, CA" and the model outputs "Los Angeles, CA", this could still be considered correct because the description does not specify whether the city name should be abbreviated or not. We have honored similar situations in BFCL evaluations. However, outputs like "LA" or "Los Angeles" would be incorrect, as they miss the state part, which is necessary according to the function description.

Regarding your third case: For the string-type parameter start_date, we will consider it correct if it is one of the several widely-used date formats ('2022-12-10', '10/12/2022', 'Dec 10, 2022', 'December 10, 2022'). However, the format you provided '10th December 2022' is not among the commonly accepted formats.

Regarding your fourth case: There are no unique criteria for optional parameters beyond standard coding practices. Specifically, if a prompt suggests that an optional parameter should be set to a non-default value, it is expected that this value will be explicitly provided. And on the other hand, if the prompt does not mention the optional parameter or the implied value matches its default setting, then the model has the flexibility to either use the default value or omit the parameter entirely.

Let me know if you need further clarification on this.

GeniusYx commented 1 month ago

Thank you for your reply~

For the fourth case, here is another example. Can you explain why the detailparameter is different from the historicalparameter? In my opinion, they are both optional parameters. Because they are not in the required list.

{ "question": "Can you find me the latest information about air quality index and pollution data for Chicago?", "function": { "name": "get_air_quality", "description": "Retrieve real-time air quality and pollution data for a specific location.", "parameters": { "type": "dict", "properties": { "location": { "type": "string", "description": "The city that you want to get the air quality data for." }, "detail": { "type": "boolean", "description": "If true, additional data like PM2.5, PM10, ozone levels, and pollution sources will be retrieved. Default is false." }, "historical": { "type": "string", "description": "Optional date (in 'YYYY-MM-DD' format) to retrieve historical data.", "default": "today" } }, "required": [ "location" ] } } }

HuanzhiMao commented 1 month ago

Sorry, could you elaborate on what you mean by why the detail parameter is different from the historical parameter? What difference are you referring to? They are indeed both optional parameters.

ShishirPatil commented 1 month ago

Closing this. @GeniusYx please feel free to re-open if you have follow-ups.