Open bri25yu opened 11 months ago
I've examined a few more of these functions, some from the beginning and some from the end. Seems that the model_answer
property is not very accurate? Please correct my ground truths if I'm wrong.
There were a few questions at the end that had no possible correct answer. For example, the last question is "Identify the taxonomic classification of a given organism." and the given function is taxonomy.identify_classification
which accepts and organism_name
str argument. There is no possibility of a correct organism_name
to be used as the argument in the input question. Certainly not like the model_answer
value of taxonomy.identify_classification(organism_name="Homo sapiens")
.
Overall, the model_answer
seems to get about 46% exact match accuracy (just doing str1 == str2
) and the gorilla-openfunctions-v0
model gets around 29%. Some correct gorilla-openfunctions-v0
calls are being marked as incorrect due to the model not providing an argument name e.g. model_answer
= currency.find(country="Brazil")
and gorilla-openfunctions-v0
answer = currency.find("Brazil")
. Not sure if these should be marked as incorrect or not. There are two of these examples (bumping up the accuracy to 36%).
Please see the terminal output, python script, and output file below!
Terminal output
Using model `gorilla-openfunctions-v0`
Inspecting 12 examples from the start and 16 from the end of the test set
In total, the `model_answer` property has an accuracy of 46.429%
In total, the `gorilla-openfunctions-v0` model has an accuracy of 28.571%
Python script
from os import getenv
import requests
import openai
"""
Mandate correct `openai` library version
See https://github.com/ShishirPatil/gorilla/blob/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/README.md?plain=1#L24
"""
assert openai.__version__ == "0.28.1"
gorilla_openfunctions_test_json_url = "https://raw.githubusercontent.com/ShishirPatil/gorilla/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/gorilla_openfunctions_test.json"
test_functions = requests.get(gorilla_openfunctions_test_json_url).json()
human_responses_from_start = [
"coffee_shop.find_nearby(location=\"San Francisco\", amenities=[\"Wi-Fi\"])",
"flight.book(origin=\"Los Angeles\", destination=\"New York\", passengers=2, date=\"2023-06-15\")",
"restaurant.book_table(restaurant=\"Italiano's\", location=\"Manhattan, New York\", party_size=2, reservation_time=\"2023-10-23 19:00:00:000\")",
"weather.forecast(location=\"Paris, France\", days=5)",
"pharmacy.find_nearby(location=\"San Diego, California\", feature=[\"24/7\", \"Drive-thru\"])",
"target.order(loc=\"Berkeley\",item=[\"potato\", \"chocolate\"], quantity=[6,8])",
"classes.find(school='Berkeley', discipline='computer science', level=['lower division', 'upper division'], status='open')",
"gas_station.find_nearby(location=\"current\")",
"restaurant.find_top_rated(location=\"San Francisco\", amenities=[\"Outdoor Seating\"])",
"timezone.get_current_time(timezone=\"America/New_York\")",
"lyrics.find(song=\"Shape of You\", artist=\"Ed Sheeran\")",
"news.get_headlines(source=\"CNN\")",
]
human_responses_from_end = [
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"NO ANSWER POSSIBLE",
"plant.get_scientific_name(common_name=\"rose\")",
"language.find_official(country=\"China\")",
"waterfall.find_highest()",
"currency.find(country=\"Brazil\")",
"temperature.find_average(location=\"Antarctica\")",
"area.calculate(location=\"Great Barrier Reef\")",
"timezone.find(location=\"New York City\")",
]
"""
Copied directly from https://github.com/ShishirPatil/gorilla/blob/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/README.md?plain=1#L30-L44
with the exception of setting a non-`EMPTY` api key.
"""
def get_gorilla_response(
prompt='Call me an Uber ride type "Plus" in Berkeley at zipcode 94704 in 10 minutes',
model="gorilla-openfunctions-v0",
functions=[],
):
openai.api_key = getenv("OPENAI_API_KEY")
openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"
try:
completion = openai.ChatCompletion.create(
model=model,
temperature=0.0,
messages=[{"role": "user", "content": prompt}],
functions=functions,
)
return completion.choices[0].message.content
except Exception as e:
print(e, model, prompt)
model = "gorilla-openfunctions-v0"
print(f"Using model `{model}`")
num_from_start = len(human_responses_from_start)
num_from_end = len(human_responses_from_end)
print(f"Inspecting {num_from_start} examples from the start and {num_from_end} from the end of the test set")
functions_to_test = test_functions[:num_from_start] + test_functions[-num_from_end:][::-1]
human_responses = human_responses_from_start + human_responses_from_end
num_model_answer_correct = 0
num_model_correct = 0
total = len(human_responses)
output_file = open("gorilla_openfunctions_output.txt", "w")
for function_dict, human_response in zip(functions_to_test, human_responses):
question = function_dict["question"]
functions = [function_dict["function"]]
model_answer = function_dict["model_answer"]
response = get_gorilla_response(prompt=question, model=model, functions=functions)
is_model_answer_correct = model_answer == human_response
num_model_answer_correct += is_model_answer_correct
is_model_correct = response == human_response
num_model_correct += is_model_correct
output_file.write(f"""
{'-' * 80}
Model answer in `gorilla_openfunctions_test.json`:\n\t`{model_answer}`
Actual model answer:\n\t`{response}`
Correct answer by a human:\n\t`{human_response}`
Is `model_answer` property correct? {is_model_answer_correct}
Is `{model}` correct? {is_model_correct}
""")
output_file.close()
model_answer_accuracy = 100 * num_model_answer_correct / total
model_accuracy = 100 * num_model_correct / total
print(f"In total, the `model_answer` property has an accuracy of {model_answer_accuracy:.3f}%")
print(f"In total, the `{model}` model has an accuracy of {model_accuracy:.3f}%")
Hey Brian, Nice to meet you and thank you very much for your attention to our work and reproducing the result!
Our training data is generated with a few shot example of {Instruction, functions, model_answer}
and prompt that leads to GPT generating data of the same format above from various domains. This results in your observation that Gorilla OpenFunctions' answer is better than the generated model_answer compared to human evaluation! Our evaluation is done by eye-examining the parameters and see if sufficient information is extracted.
With that being said, we would love to have you contributing your evaluation methodology and evaluation data to make the benchmarking more robust. Exact-matching also provides some interesting result as it demonstrates the robustness of model to follow data formatting according to API documentation.
Thanks for raising this issue again!
Hi Fanjia!
Gotcha, thanks for the response! I'm still a little confused sorry. Where is the ground truth used for the OpenFunctions test dataset if model_answer
is not the reference? Specifically, how were the numbers in the blogpost produced? There are 9 questions in the OpenFunctions test set with no answer possible out of 116, so the maximum score is 107/116 ~92%. How did GPT4 get 95%?
Cheers, Brian
Hey Brian, thank you for bouncing this issue.
Hi Fanjia!
Gotcha, just to confirm -- I should not evaluate using test.json
because the ground truths may be wrong? If so, how should I reproduce the evaluation numbers in the blog post?
Cheers, Brian
Came here to ask the same thing. The "test" set has no ground truth labels. They do not match example:
Hi Gorilla authors!
Congrats on the new Gorilla OpenFunctions model. Super cool stuff :)
Just a quick question about the OpenFunctions test set found at https://raw.githubusercontent.com/ShishirPatil/gorilla/a0b0d3174c581d18d615e3dc63935483e9d45da7/openfunctions/gorilla_openfunctions_test.json. Seems like some of the ground truths aren't correct, at least according to the
model_answer
property in the JSON itself. When you run the following script on the test inputs, the Gorilla model actually outputs better generations than the includedmodel_answer
property. See the script and output below!Can you please help resolve this test set inconsistency? Which is the ground truth answer in the JSON, is it
model_answer
?Setup
Output