OpenFunctions evaluation dataset inconsistencies

Hi Gorilla authors!

Congrats on the new Gorilla OpenFunctions model. Super cool stuff :)

Just a quick question about the OpenFunctions test set found at https://raw.githubusercontent.com/ShishirPatil/gorilla/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/gorilla_openfunctions_test.json. Seems like some of the ground truths aren't correct, at least according to the model_answer property in the JSON itself. When you run the following script on the test inputs, the Gorilla model actually outputs better generations than the included model_answer property. See the script and output below!

Can you please help resolve this test set inconsistency? Which is the ground truth answer in the JSON, is it model_answer?

Setup

export OPENAI_API_KEY=

from os import getenv

import requests

import openai

"""
Mandate correct `openai` library version
See https://github.com/ShishirPatil/gorilla/blob/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/README.md?plain=1#L24
"""
assert openai.__version__ == "0.28.1"

gorilla_openfunctions_test_json_url = "https://raw.githubusercontent.com/ShishirPatil/gorilla/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/gorilla_openfunctions_test.json"
test_functions = requests.get(gorilla_openfunctions_test_json_url).json()

human_responses = [
    'coffee_shop.find_nearby(location="San Francisco", amenities=["Wi-Fi"])',
    'flight.book(origin="Los Angeles", destination="New York", passengers=2, date="2023-06-15")',
]

"""
Copied directly from https://github.com/ShishirPatil/gorilla/blob/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/README.md?plain=1#L30-L44
with the exception of setting a non-`EMPTY` api key.
"""

def get_gorilla_response(
    prompt='Call me an Uber ride type "Plus" in Berkeley at zipcode 94704 in 10 minutes',
    model="gorilla-openfunctions-v0",
    functions=[],
):
    openai.api_key = getenv("OPENAI_API_KEY")
    openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"
    try:
        completion = openai.ChatCompletion.create(
            model=model,
            temperature=0.0,
            messages=[{"role": "user", "content": prompt}],
            functions=functions,
        )
        return completion.choices[0].message.content
    except Exception as e:
        print(e, model, prompt)

model = "gorilla-openfunctions-v0"
print(f"Using model `{model}`")

for function_dict, human_response in zip(test_functions, human_responses):
    question = function_dict["question"]
    functions = [function_dict["function"]]
    model_answer = function_dict["model_answer"]

    response = get_gorilla_response(prompt=question, model=model, functions=functions)

    print(f"\n{'-' * 80}\n")
    print(f"Model answer in `gorilla_openfunctions_test.json`:\n\t`{model_answer}`")
    print(f"Actual model answer:\n\t`{response}`")
    print(f"Correct answer by a human:\n\t`{human_response}`")

    is_model_answer_correct = model_answer == human_response
    print(f"Is `model_answer` correct? {is_model_answer_correct}")

    is_model_correct = response == human_response
    print(f"Is `{model}` correct? {is_model_correct}")

Output

Using model `gorilla-openfunctions-v0`

--------------------------------------------------------------------------------

Model answer in `gorilla_openfunctions_test.json`:
        `coffee_shop.find_nearby(location="San Francisco", amenities="Wi-Fi")`
Actual model answer:
        `coffee_shop.find_nearby(location="San Francisco", amenities=["Wi-Fi"])`
Correct answer by a human:
        `coffee_shop.find_nearby(location="San Francisco", amenities=["Wi-Fi"])`
Is `model_answer` correct? False
Is `gorilla-openfunctions-v0` correct? True

--------------------------------------------------------------------------------

Model answer in `gorilla_openfunctions_test.json`:
        `flight.book(origin="Los Angeles", destination="New York", passengers=2, date="June 15th")`
Actual model answer:
        `flight.book(origin="Los Angeles", destination="New York", passengers=2, date="2022-06-15")`
Correct answer by a human:
        `flight.book(origin="Los Angeles", destination="New York", passengers=2, date="2023-06-15")`
Is `model_answer` correct? False
Is `gorilla-openfunctions-v0` correct? False

I've examined a few more of these functions, some from the beginning and some from the end. Seems that the model_answer property is not very accurate? Please correct my ground truths if I'm wrong.

There were a few questions at the end that had no possible correct answer. For example, the last question is "Identify the taxonomic classification of a given organism." and the given function is taxonomy.identify_classification which accepts and organism_name str argument. There is no possibility of a correct organism_name to be used as the argument in the input question. Certainly not like the model_answer value of taxonomy.identify_classification(organism_name="Homo sapiens").

Overall, the model_answer seems to get about 46% exact match accuracy (just doing str1 == str2) and the gorilla-openfunctions-v0 model gets around 29%. Some correct gorilla-openfunctions-v0 calls are being marked as incorrect due to the model not providing an argument name e.g. model_answer = currency.find(country="Brazil") and gorilla-openfunctions-v0 answer = currency.find("Brazil"). Not sure if these should be marked as incorrect or not. There are two of these examples (bumping up the accuracy to 36%).

Please see the terminal output, python script, and output file below!

Terminal output

Using model `gorilla-openfunctions-v0`
Inspecting 12 examples from the start and 16 from the end of the test set
In total, the `model_answer` property has an accuracy of 46.429%
In total, the `gorilla-openfunctions-v0` model has an accuracy of 28.571%

Python script

from os import getenv

import requests

import openai

"""
Mandate correct `openai` library version
See https://github.com/ShishirPatil/gorilla/blob/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/README.md?plain=1#L24
"""
assert openai.__version__ == "0.28.1"

gorilla_openfunctions_test_json_url = "https://raw.githubusercontent.com/ShishirPatil/gorilla/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/gorilla_openfunctions_test.json"
test_functions = requests.get(gorilla_openfunctions_test_json_url).json()

human_responses_from_start = [
    "coffee_shop.find_nearby(location=\"San Francisco\", amenities=[\"Wi-Fi\"])",
    "flight.book(origin=\"Los Angeles\", destination=\"New York\", passengers=2, date=\"2023-06-15\")",
    "restaurant.book_table(restaurant=\"Italiano's\", location=\"Manhattan, New York\", party_size=2, reservation_time=\"2023-10-23 19:00:00:000\")",
    "weather.forecast(location=\"Paris, France\", days=5)",
    "pharmacy.find_nearby(location=\"San Diego, California\", feature=[\"24/7\", \"Drive-thru\"])",
    "target.order(loc=\"Berkeley\",item=[\"potato\", \"chocolate\"], quantity=[6,8])",
    "classes.find(school='Berkeley', discipline='computer science', level=['lower division', 'upper division'], status='open')",
    "gas_station.find_nearby(location=\"current\")",
    "restaurant.find_top_rated(location=\"San Francisco\", amenities=[\"Outdoor Seating\"])",
    "timezone.get_current_time(timezone=\"America/New_York\")",
    "lyrics.find(song=\"Shape of You\", artist=\"Ed Sheeran\")",
    "news.get_headlines(source=\"CNN\")",
]
human_responses_from_end = [
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "NO ANSWER POSSIBLE",
    "plant.get_scientific_name(common_name=\"rose\")",
    "language.find_official(country=\"China\")",
    "waterfall.find_highest()",
    "currency.find(country=\"Brazil\")",
    "temperature.find_average(location=\"Antarctica\")",
    "area.calculate(location=\"Great Barrier Reef\")",
    "timezone.find(location=\"New York City\")",
]

"""
Copied directly from https://github.com/ShishirPatil/gorilla/blob/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/README.md?plain=1#L30-L44
with the exception of setting a non-`EMPTY` api key.
"""

def get_gorilla_response(
    prompt='Call me an Uber ride type "Plus" in Berkeley at zipcode 94704 in 10 minutes',
    model="gorilla-openfunctions-v0",
    functions=[],
):
    openai.api_key = getenv("OPENAI_API_KEY")
    openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"
    try:
        completion = openai.ChatCompletion.create(
            model=model,
            temperature=0.0,
            messages=[{"role": "user", "content": prompt}],
            functions=functions,
        )
        return completion.choices[0].message.content
    except Exception as e:
        print(e, model, prompt)

model = "gorilla-openfunctions-v0"
print(f"Using model `{model}`")

num_from_start = len(human_responses_from_start)
num_from_end = len(human_responses_from_end)
print(f"Inspecting {num_from_start} examples from the start and {num_from_end} from the end of the test set")
functions_to_test = test_functions[:num_from_start] + test_functions[-num_from_end:][::-1]
human_responses = human_responses_from_start + human_responses_from_end

num_model_answer_correct = 0
num_model_correct = 0
total = len(human_responses)
output_file = open("gorilla_openfunctions_output.txt", "w")
for function_dict, human_response in zip(functions_to_test, human_responses):
    question = function_dict["question"]
    functions = [function_dict["function"]]
    model_answer = function_dict["model_answer"]

    response = get_gorilla_response(prompt=question, model=model, functions=functions)

    is_model_answer_correct = model_answer == human_response
    num_model_answer_correct += is_model_answer_correct

    is_model_correct = response == human_response
    num_model_correct += is_model_correct

    output_file.write(f"""
{'-' * 80}
Model answer in `gorilla_openfunctions_test.json`:\n\t`{model_answer}`
Actual model answer:\n\t`{response}`
Correct answer by a human:\n\t`{human_response}`
Is `model_answer` property correct? {is_model_answer_correct}
Is `{model}` correct? {is_model_correct}
""")

output_file.close()

model_answer_accuracy = 100 * num_model_answer_correct / total
model_accuracy = 100 * num_model_correct / total
print(f"In total, the `model_answer` property has an accuracy of {model_answer_accuracy:.3f}%")
print(f"In total, the `{model}` model has an accuracy of {model_accuracy:.3f}%")

gorilla_openfunctions_output.txt

Hey Brian, Nice to meet you and thank you very much for your attention to our work and reproducing the result!

Our training data is generated with a few shot example of {Instruction, functions, model_answer} and prompt that leads to GPT generating data of the same format above from various domains. This results in your observation that Gorilla OpenFunctions' answer is better than the generated model_answer compared to human evaluation! Our evaluation is done by eye-examining the parameters and see if sufficient information is extracted.

With that being said, we would love to have you contributing your evaluation methodology and evaluation data to make the benchmarking more robust. Exact-matching also provides some interesting result as it demonstrates the robustness of model to follow data formatting according to API documentation.

Thanks for raising this issue again!

Hi Fanjia!

Gotcha, thanks for the response! I'm still a little confused sorry. Where is the ground truth used for the OpenFunctions test dataset if model_answer is not the reference? Specifically, how were the numbers in the blogpost produced? There are 9 questions in the OpenFunctions test set with no answer possible out of 116, so the maximum score is 107/116 ~92%. How did GPT4 get 95%?

Cheers, Brian

Hey Brian, thank you for bouncing this issue.

To answer your question regarding the "No answer possible". Yes, our existing training data have question that misses some required parameters. From our current benchmarking, we observe two behaviors : 1) Chat model typically fill the required parameters up with some default or replaceable value. Gorilla OpenFunctions is also doing that. 2) The Function calling model typically asks follow up question looking for user input on required parameters to ensure successful function calling. This results in a status we call "incomplete". In our accuracy measurement, we calculate add "success" with "incomplete" as "accuracy". Our rationale is that the model successfully identify the missing parameters and as long as we fill that in, it will supply satisfactory result. As to how success is judged, we perform manual inspection on whether the model successfully extract the information out of the instructions as parameters.
We are planning to launch a leaderboard as next step which will perform a more exhaustive benchmarking with more data. Your feedback and evaluation is important to us!

Hi Fanjia!

Gotcha, just to confirm -- I should not evaluate using test.json because the ground truths may be wrong? If so, how should I reproduce the evaluation numbers in the blog post?

Cheers, Brian

Came here to ask the same thing. The "test" set has no ground truth labels. They do not match example:

ShishirPatil / gorilla

OpenFunctions evaluation dataset inconsistencies #143