ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.17k stars 918 forks source link

Leaderboard evaluations issues #363

Closed danieljannai21 closed 4 months ago

danieljannai21 commented 4 months ago

Hi,

First of all, I really appreciate your work!

I've been trying to run the leaderboard evaluation code, and noted several issues. I can probably fix some of them on my own in a PR, but wanted to hear your thought before that.

  1. Specificity of argument description. In several functions, I noted that the description of the parameters is not explicit enough, which results in the models being evaluated sometimes outputting the "wrong" parameters. Attached are a few examples.

In this function, it isn't mentioned that interest_rate shouldn't be the number of percents (that is, 3.5 instead of 0.035 for example):

def mortgage_calculator(loan_amount, interest_rate, loan_period):
    """
    Calculates the monthly mortgage payment.
    Args:
        loan_amount (integer): The amount of the loan.
        interest_rate (integer): The interest rate of the loan.
        loan_period (integer): The period of the loan.
    """
    monthly_interest_rate = interest_rate / 12
    number_of_payments = loan_period * 12
    monthly_payment = (
        loan_amount
        * monthly_interest_rate
        * (1 + monthly_interest_rate) ** number_of_payments
        / ((1 + monthly_interest_rate) ** number_of_payments - 1)
    )
    return monthly_payment

In this example, it isn't mentioned that the stock_name should be a ticker symbol and not the company's name:

def get_company_name_by_stock_name(stock_name):
    """
    Finds the company name of a stock by its stock name.
    Args:
        stock_name (str): The stock name of the product.
    """
    url = "https://yahoo-finance15.p.rapidapi.com/api/v1/markets/search"

    querystring = {"search": stock_name}

    headers = {
        "X-RapidAPI-Key": api_key["RAPID-API-KEY"],
        "X-RapidAPI-Host": "yahoo-finance15.p.rapidapi.com",
    }

    response = requests.get(url, headers=headers, params=querystring)
    try:
        return response.json()["body"][0]["name"]
    except:
        return response.json()
  1. Hard typing constraints. For example, one of the questions in gorilla_openfunctions_v1_test_executable_multiple_function.json is As a data analyst, you are working on a project that requires you to organize a set of numerical data. Can you sort these numbers 34, 2, 56, 7, 9, 12 in descending order?, where the description of the relavant function is:

    {
    "name": "sort_array",
    "description": "Sorts an array of numbers.",
    "parameters": {
    "type": "dict",
    "properties": {
      "array": {
        "type": "array",
        "items": {
          "type": "float"
        },
        "description": "The array of numbers."
      },
      "reverse": {
        "type": "boolean",
        "description": "Whether to sort the array in reverse order.",
        "default": false
      }
    },
    "required": [
      "array"
    ]
    }
    }

    Since the given numbers are all integers, there shouldn't be a problem in sending them to the function as ints, rather than floats. Generally speaking, I think the typing constraints on floats should also be able to accept ints, as long as their values as correct.

  2. Several incorrect "gold" answers. For example, in gorilla_openfunctions_v1_test_executable_simple.json, one of the questions is Book a king room of 10000 dollar from Dec.11,2023, to Aug.15,2024, with customer id 123., while the gold answer states that the value of total_price should be 1000, rather than 10000.

I also found a few cases where the list of values in "possible_answer" wasn't exhaustive enough, but I can't find any specific example right now.

Thanks in advance!

HuanzhiMao commented 4 months ago

Hi @danieljannai21,

Thank you for your attention and thanks for flagging this! Sorry for the delayed reply; we were a bit busy recently :/

Best, BFCL Team

danieljannai21 commented 4 months ago

Hi @HuanzhiMao, Thank you for your answer!

HuanzhiMao commented 4 months ago

Hi @danieljannai21 ,

Good question. Let me give you an example where type matters. Let's say the function doc asks for an integer-type argument x, and the source code of that function is defined as:

def func(x: int):
    for i in range(1, 100 // x):
        print(i)

Here, if you follow the function doc and do func(5), it would work fine. But if you give func(5.0), you will get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in func
TypeError: 'float' object cannot be interpreted as an integer

Since the model doesn't have access to the function source code, the model cannot safely assume that it's okay to mess up the type. Moreover, in programming languages like Java, the compiler would error if you provide a float value when the argument type is an int. It's simply invalid syntax.

Regarding your second concern: Thank you for pointing out how we format the prompt. We do observe questions starting with "How do I?" or similar in which it asks questions rather than provide tool use instruction. Here are 2 scenarios:

With the above, if a model outputs a complete explanation, it is either not a user-friendly function-calling model or it is not following prompting instructions carefully. We would love to have questions rephrased more diversely in terms of tone to simulate real-world tool usage better, but that will possibly be in our future release.

danieljannai21 commented 3 months ago

Thanks, @HuanzhiMao!

Thanks again for the wonderful work you guys are doing.