Leaderboard evaluations issues

Hi,

First of all, I really appreciate your work!

I've been trying to run the leaderboard evaluation code, and noted several issues. I can probably fix some of them on my own in a PR, but wanted to hear your thought before that.

Specificity of argument description. In several functions, I noted that the description of the parameters is not explicit enough, which results in the models being evaluated sometimes outputting the "wrong" parameters. Attached are a few examples.

In this function, it isn't mentioned that interest_rate shouldn't be the number of percents (that is, 3.5 instead of 0.035 for example):

def mortgage_calculator(loan_amount, interest_rate, loan_period):
    """
    Calculates the monthly mortgage payment.
    Args:
        loan_amount (integer): The amount of the loan.
        interest_rate (integer): The interest rate of the loan.
        loan_period (integer): The period of the loan.
    """
    monthly_interest_rate = interest_rate / 12
    number_of_payments = loan_period * 12
    monthly_payment = (
        loan_amount
        * monthly_interest_rate
        * (1 + monthly_interest_rate) ** number_of_payments
        / ((1 + monthly_interest_rate) ** number_of_payments - 1)
    )
    return monthly_payment

In this example, it isn't mentioned that the stock_name should be a ticker symbol and not the company's name:

def get_company_name_by_stock_name(stock_name):
    """
    Finds the company name of a stock by its stock name.
    Args:
        stock_name (str): The stock name of the product.
    """
    url = "https://yahoo-finance15.p.rapidapi.com/api/v1/markets/search"

    querystring = {"search": stock_name}

    headers = {
        "X-RapidAPI-Key": api_key["RAPID-API-KEY"],
        "X-RapidAPI-Host": "yahoo-finance15.p.rapidapi.com",
    }

    response = requests.get(url, headers=headers, params=querystring)
    try:
        return response.json()["body"][0]["name"]
    except:
        return response.json()

Hard typing constraints. For example, one of the questions in gorilla_openfunctions_v1_test_executable_multiple_function.json is As a data analyst, you are working on a project that requires you to organize a set of numerical data. Can you sort these numbers 34, 2, 56, 7, 9, 12 in descending order?, where the description of the relavant function is:
```
{
"name": "sort_array",
"description": "Sorts an array of numbers.",
"parameters": {
"type": "dict",
"properties": {
  "array": {
    "type": "array",
    "items": {
      "type": "float"
    },
    "description": "The array of numbers."
  },
  "reverse": {
    "type": "boolean",
    "description": "Whether to sort the array in reverse order.",
    "default": false
  }
},
"required": [
  "array"
]
}
}
```
Since the given numbers are all integers, there shouldn't be a problem in sending them to the function as ints, rather than floats. Generally speaking, I think the typing constraints on floats should also be able to accept ints, as long as their values as correct.
Several incorrect "gold" answers. For example, in gorilla_openfunctions_v1_test_executable_simple.json, one of the questions is Book a king room of 10000 dollar from Dec.11,2023, to Aug.15,2024, with customer id 123., while the gold answer states that the value of total_price should be 1000, rather than 10000.

I also found a few cases where the list of values in "possible_answer" wasn't exhaustive enough, but I can't find any specific example right now.

Thanks in advance!

Hi @danieljannai21,

Thank you for your attention and thanks for flagging this! Sorry for the delayed reply; we were a bit busy recently :/

Regarding your first and third points: Yes, this is an oversight on our end. We do realise that some of the prompts and possible answers in the executable evaluation datasets have bugs. We plan to go through them today. Expect a patch update on this very soon :)
Regarding your second point: Our current evaluation pipeline doesn't enforce type constraints for the executable test category; we only do type-checking for the AST category. For the executable category, we only care about the execution result of the function call. As long as the result is expected, then it would be correct.

Best, BFCL Team

Hi @HuanzhiMao, Thank you for your answer!

Regarding my second point, I still don't understand - even is AST evaluation, why would we want to enforce typing even in the case of an integer that's represented as a float? Wouldn't it be 100% valid to use an int in that case, as the conversation can be done seamlessly without any information loss?
Also, not sure if it's something you addressed in your data fixes, but I noted that in the Java category (and possibly other ones as well, but I didn't read everything thoroughly), the questions are formatted as "How do I ?" instead of just "", so many models tend to output a complete explanation (that also includes the tool invocation as part of it), instead of just outputting the tool invocation alone, which can be more easily parsed and evaluated.

Hi @danieljannai21 ,

Good question. Let me give you an example where type matters. Let's say the function doc asks for an integer-type argument x, and the source code of that function is defined as:

def func(x: int):
    for i in range(1, 100 // x):
        print(i)

Here, if you follow the function doc and do func(5), it would work fine. But if you give func(5.0), you will get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in func
TypeError: 'float' object cannot be interpreted as an integer

Since the model doesn't have access to the function source code, the model cannot safely assume that it's okay to mess up the type. Moreover, in programming languages like Java, the compiler would error if you provide a float value when the argument type is an int. It's simply invalid syntax.

Regarding your second concern: Thank you for pointing out how we format the prompt. We do observe questions starting with "How do I?" or similar in which it asks questions rather than provide tool use instruction. Here are 2 scenarios:

When the model possesses the capacity to make function calls, we expect the model to provide function calls in an "easy-to-extract" fashion (i.e. JSON format, or direct function call format) if the model decides to make this choice. An example is the Claude-3 function-calling has a "thinking" component that's independent of the tool use section. Therefore, when evaluating the function-calling model, we expect the tool invocation to be concise and convenient.
When we prompt the model to make function call, we specifically format the prompt (Should you decide to return the function call(s), Put it in the format of [func1(params_name=params_value, params_name2=params_value2...), func2(params)] \n NO other text MUST be included.) such that no explanation should be provided if the model followed the instructions.

With the above, if a model outputs a complete explanation, it is either not a user-friendly function-calling model or it is not following prompting instructions carefully. We would love to have questions rephrased more diversely in terms of tone to simulate real-world tool usage better, but that will possibly be in our future release.

Thanks, @HuanzhiMao!

I agree that the model should be able to output a function call in a parsable manner, but I don't think that's what your checking in the "How do I ..." questions. The phrasing of these questions doesn't suggest that the user wants the model to call a function, but asks it how to do something, in which case the model can tell the user to call the relevant function. In my opinion, phrasing the questions as "Please do ..." would yield much better results for most models, and would better evaluate their function calling capabilities. Also, the "NO other text MUST be included" is only added to the prompted models, and not the FC models.
Another thing I noticed is that there are cases where the gold answer contains a parameter that doesn't appear in the description of the function (for example, the price parameter doesn't exist in the description of the book_room function in question 46 of execution_multiple_function category. I would suggest running an automatic validation that compares the function's documentation in the tools section, the way it's invoked in the gold answer, and the actual implementation (if exists) and makes sure they're all consistent. Shouldn't be that hard to implement.

Thanks again for the wonderful work you guys are doing.

ShishirPatil / gorilla

Leaderboard evaluations issues #363