ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.51k stars 1.01k forks source link

[BFCL] javascript test category fails on checking Array argument's type #776

Closed leocnj closed 1 day ago

leocnj commented 3 days ago

When using latest BFCL v3 to evaluate MadeAgents_Hammer2.0-1.5b model, I found that the model fails on the javascript_2 sample. However, when checking details, it looks that this error is a false alarm - status argument needs be an array containing String items. However, when checking status argument's type, the eval code seems incorrectly thinks the type needs to be a String.

{
  "id": 3,
  "model_name": "MadeAgents_Hammer2.0-1.5b",
  "test_category": "javascript",
  "valid": false,
  "error": [
    "Incorrect type for parameter 'status'. Expected type String, got list. Parameter value: ['completed', 'failed']."
  ],
  "error_type": "type_error:js",
  "prompt": {
    "id": "javascript_2",
    "question": [
      [
        {
          "role": "user",
          "content": "Help me extract the last transaction ID that has a status of 'completed' or 'failed' from a database log located at '/var/log/db.log', using 'utf-8' encoding, and process the information with a processing function?"
        }
      ]
    ],
    "function": [
      {
        "name": "extractLastTransactionId",
        "description": "This function scans a database log file for lines indicating transaction completion or failure, extracting the last transaction ID that matches the criteria. It uses a processing function `processFunction` to further handle the extracted transaction ID.",
        "parameters": {
          "type": "dict",
          "properties": {
            "filepath": {
              "type": "String",
              "description": "The path to the database log file to be examined."
            },
            "status": {
              "type": "array",
              "items": {
                "type": "String"
              },
              "description": "An array of statuses to search for within the log file, indicating the end of a transaction."
            },
            "encoding": {
              "type": "String",
              "description": "The encoding of the log file."
            },
            "processFunction": {
              "type": "any",
              "description": "A function that processes the extracted transaction ID."
            }
          },
          "required": [
            "filepath",
            "status",
            "encoding",
            "processFunction"
          ]
        }
      }
    ]
  },
  "model_result_raw": "```\n[{\"name\": \"extractLastTransactionId\", \"arguments\": {\"filepath\": \"/var/log/db.log\", \"status\": [\"completed\", \"failed\"], \"encoding\": \"utf-8\", \"processFunction\": \"processTransaction\"}}]\n```",
    "model_result_decoded": [
    {
      "extractLastTransactionId": {
        "filepath": "/var/log/db.log",
        "status": [
          "completed",
          "failed"
        ],
        "encoding": "utf-8",
        "processFunction": "processTransaction"
      }
    }
  ],
  "possible_answer": [
    {
      "extractLastTransactionId": {
        "filepath": [
          "/var/log/db.log"
        ],
        "status": [
          [
            "completed",
            "failed"
          ]
        ],
        "encoding": [
          "utf-8"
        ],
        "processFunction": [
          "processFunction"
        ]
      }
    }
  ]
}
HuanzhiMao commented 2 days ago

Hi @leocnj, Thanks for the issue. We are investigating.

HuanzhiMao commented 2 days ago

Hi @leocnj, This is not a false alarm. For Java and JavaScript categories, before querying the model, we do some pre-processing on the prompt and function document. Specifically, at the end of the prompt, we will explicitly state that the provided function is in Java 8/JavaScript syntax. We will also change the parameter type to String (since String is JSON compatible) and add in the parameter description that "This is Java/Javascript" + {original_type} + " in string representation." So in your example, the model should provide a JavaScript List in the string representation, eg '["completed", "failed"]'.

leocnj commented 1 day ago

@HuanzhiMao , thanks very much for your prompt answer.

In Hammer model, I found the code piece you referred to. Given such change to the function spec, I agree that model output (list) is not a false alarm here.

https://github.com/ShishirPatil/gorilla/blob/12629f27ae5675adca57db79e2d0b3b3144b3e55/berkeley-function-call-leaderboard/bfcl/model_handler/oss_model/hammer.py#L145