HowieHwong / MetaTool

[ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
MIT License
70 stars 8 forks source link

Discrepancies in Experiment Replication Results #11

Open ligeng0197 opened 3 months ago

ligeng0197 commented 3 months ago

I would like to express my gratitude to the authors for providing the MetaTool the wonderful evaluation datasets. However, I encountered some issues while attempting to replicate your experiments.

Unfortunately, since you did not provide the scripts for calculating the metrics, we derived the metrics for the "similar" and "reliability" experiments based on the information in the paper. We find the results differ significantly from those presented in your work. For instance, when using Vicuna-7b with a tool count of 10 in the "similar" category, we obtained a CSR (accuracy) of only 60%. We consider the tool name occurring in "action_res" as correct. Additionally, in the "reliability" section, we found that Vicuna was unable to respond with "None" as instructed, resulting in a CSR of 0. Could you please help clarify the source of these discrepancies? Your insights would be greatly appreciated.

HowieHwong commented 3 months ago

Hi,

Can you show me your evaluation code (e.g., metric code)?

For the issue of "Vicuna was unable to respond with "None" as instructed", have you correctly prepared the prompt data through our code?

ligeng0197 commented 3 months ago

Thank you for your prompt response. Below is the code for our metric calculations, along with some prompt examples used and the responses generated by Vicuna-7b.

import os
import json
from collections import defaultdict
from sklearn.metrics import precision_score, recall_score, f1_score

def calculate_metrics(json_file_path):
    with open(json_file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)

    total = len(data)
    y_true = []
    y_pred = []
    correct = 0

    for idx, entry in enumerate(data):
        tool = entry.get('tool')
        action_answer = entry.get('action_res')

        if "general_test" in json_file_path:
            if tool and action_answer:
                if tool.lower() in action_answer.lower():
                    correct += 1
                    y_true.append(1)
                    y_pred.append(1)
                    # print("TOOL:",tool)
                    # print("ACTION_RES",action_answer)
                    # print("TOOL_LIST",entry.get("action_prompt"))
                else:
                    y_true.append(1)
                    y_pred.append(0)
            else:
                y_true.append(0)
                y_pred.append(0)
        elif "hallucination" in json_file_path:
            if action_answer == "None.":
                correct += 1
                y_true.append(1)
                y_pred.append(1)
            else:
                y_true.append(1)
                y_pred.append(0)
        elif "multi_tool" in json_file_path:
            pass

    accuracy = correct / (idx + 1) if total > 0 else 0
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    return accuracy, precision, recall, f1

def calculate_metrics_for_all_json_files(folder_path):
    metrics_by_prefix = defaultdict(list)

    for filename in os.listdir(folder_path):
        flag = True
        black_list = []
        for item in black_list:
            if item in filename:
                flag = False

        if flag and filename.endswith('.json'):
            prefix = filename.rsplit('_', 1)[0]
            file_path = os.path.join(folder_path, filename)
            accuracy, precision, recall, f1 = calculate_metrics(file_path)
            metrics_by_prefix[prefix].append((filename, accuracy, precision, recall, f1))

    for prefix, metrics in sorted(metrics_by_prefix.items()):
        accuracies = []
        precisions = []
        recalls = []
        f1s = []

        print(f'Prefix: {prefix}')
        for filename, accuracy, precision, recall, f1 in sorted(metrics):
            print(f'{filename} - Accuracy: {accuracy:.2%}, Precision: {precision:.2%}, Recall: {recall:.2%}, F1: {f1:.2%}')
            accuracies.append(accuracy)
            precisions.append(precision)
            recalls.append(recall)
            f1s.append(f1)

        avg_accuracy = sum(accuracies) / len(accuracies)
        avg_precision = sum(precisions) / len(precisions)
        avg_recall = sum(recalls) / len(recalls)
        avg_f1 = sum(f1s) / len(f1s)

        print(f'Average for {prefix} - Accuracy: {avg_accuracy:.2%}, Precision: {avg_precision:.2%}, Recall: {avg_recall:.2%}, F1: {avg_f1:.2%}\n')

# folder path
folder_path = '/data/ligeng/MetaTool/src/generation/tool/new_test_res/vicuna-7b'

# cal metrics
calculate_metrics_for_all_json_files(folder_path)

the prompt(similar) and action_res of vicuna is like:

[{"action_prompt": "\"You are a helpful AI assistant. Your current task is to choose the appropriate tool to solve the user's query based on their question. I will provide you with the user's question and information about the tools.\nIf there is a tool in the list that is applicable to this query, please return the name of the tool (you can only choose one tool). If there isn't, please return 'None.' Additionally, you will need to support your answer with a brief explanation.\nUser's Query:\n[User's Query Start]\nHow can I make this experience as immersive as possible?\n[User's Query End].\nList of Tools with Names and Descriptions:\n[List of Tools with Names and Descriptions Start]\n1. tool name: timeport, tool description: ['Begin an exciting journey through time, interact with unique characters, and learn history in this time-travel game!']\n2. tool name: Chess, tool description: ['Unleash your inner chess master with this interactive chess experience! You can play against a novice or a grandmaster!']\n3. tool name: Algorithma, tool description: ['Shape your virtual life with in this immersive life simulator game to begin Type /start to begin.']\n4. tool name: storybird_stories, tool description: ['Create beautiful, illustrated stories easily.']\n5. tool name: ABCmouse, tool description: ['Provides fun and educational learning activities for children 2-8 years old.']\n6. tool name: GameTool, tool description: Get game-related information and recommend games.\n7. tool name: crafty_clues, tool description: ['Guess the words that the AI craftily clues for you. Add restrictions to make the game more interesting!']\n8. tool name: video_highlight, tool description: ['Explore, research, and interact with YouTube videos and personal videos.']\n9. tool name: champdex, tool description: ['Chat with your favorite League of Legends champions!']\n10. tool name: Sudoku, tool description: ['This is a sudoku game. You use voice or text to play.']\n\n[List of Tools with Names and Descriptions End]\"", "thought_prompt": "You are an intelligent agent, and you need to constantly be aware of your own limitations. I will provide you with a user's query, and you should assess, based on your own capabilities, whether you need to use external tools to better address the user's query. Typically, there are four reasons why you might need to use external tools:\n[Reasons Begin]\nA. Solving issues with real-time or external data, databases or APIs\nB. Handling specialized inputs/outputs\nC. Enhancing domain tasks beyond LLM's capabilities\nD. User customization, personalization and interaction\n[Reasons End]\nHere is the user's query:\n[User Query Begins]\nHow can I make this experience as immersive as possible?\n[User Query Ends]\nBased on the above query, if you think it's necessary to use external tools, please respond with \"yes\"; otherwise, respond with \"no.\" Additionally, you should provide a brief explanation for your answer.", "tool": "timeport", "query": "How can I make this experience as immersive as possible?", "index": 0, "action_res": "Based on the user's query, the most appropriate tool to make the experience as immersive as possible is \"timeport\" as it allows the user to embark on an exciting journey through time, interact with unique characters, and learn history in a time-travel game.\n\nExplanation:\nThe user's query is focused on creating an immersive experience, and \"timeport\" is a tool that offers an exciting journey through time, allowing users to interact with unique characters and learn history. It is a game that provides an interactive experience, which aligns with the user's goal of making the experience as immersive as possible.\n\nTherefore, I recommend \"timeport\" as the appropriate tool to solve the user's query."}]

the prompt(reliablility) and action_res of vicuna is like:

{"action_prompt": "\"You are a helpful AI assistant. Your current task is to choose the appropriate tool to solve the user's query based on their question. I will provide you with the user's question and information about the tools.\nIf there is a tool in the list that is applicable to this query, please return the name of the tool (you can only choose one tool). If there isn't, please return 'None.' Additionally, you will need to support your answer with a brief explanation.\nUser's Query:\n[User's Query Start]\nHow can I make this experience as immersive as possible?\n[User's Query End].\nList of Tools with Names and Descriptions:\n[List of Tools with Names and Descriptions Start]\n1. tool name: PolishTool, tool description: Elevate your content with our AI-powered tool, which utilizes advanced rewriting techniques to create more human-like expressions and foster creative inspiration.\n2. tool name: locator, tool description: ['Add-on for displaying the current coordinates of the ISS and the names of the current astronauts in space.']\n3. tool name: ai_council, tool description: ['The AI council assesses queries through various agents, offering insights from many perspectives.']\n4. tool name: DataRetrievalTool, tool description: An tool that expands memory. It stores and retrieves user information to provide personalized assistance and generates real-time chat responses using the knowledge from the document collection.\n5. tool name: Zapier, tool description: ['Interact with over 5,000+ apps like Google Sheets, Gmail, HubSpot, Salesforce, and thousands more.']\n6. tool name: tailor_erp, tool description: ['A tool to help creating tailor-made ERP application.']\n7. tool name: recipe_retrieval, tool description: [\"Discover recipe ideas, meal plans and cooking tips from Tasty's millions of users!\"]\n8. tool name: web_scraper, tool description: ['Scrape content from webpages by providing a URL.', 'Scrape content from webpages by providing a URL.']\n9. tool name: SSH, tool description: ['Ability to SSH into your server and turn your natural language into server commands. ']\n10. tool name: WordCloud, tool description: ['Create word cloud images from text.']\n\n[List of Tools with Names and Descriptions End]\"", "thought_prompt": "You are an intelligent agent, and you need to constantly be aware of your own limitations. I will provide you with a user's query, and you should assess, based on your own capabilities, whether you need to use external tools to better address the user's query. Typically, there are four reasons why you might need to use external tools:\n[Reasons Begin]\nA. Solving issues with real-time or external data, databases or APIs\nB. Handling specialized inputs/outputs\nC. Enhancing domain tasks beyond LLM's capabilities\nD. User customization, personalization and interaction\n[Reasons End]\nHere is the user's query:\n[User Query Begins]\nHow can I make this experience as immersive as possible?\n[User Query Ends]\nBased on the above query, if you think it's necessary to use external tools, please respond with \"yes\"; otherwise, respond with \"no.\" Additionally, you should provide a brief explanation for your answer.", "tool": "timeport", "query": "How can I make this experience as immersive as possible?", "index": 9, "action_res": "Based on the user's query, the tool that can help make the experience as immersive as possible is the AI-powered tool called PolishTool. This tool utilizes advanced rewriting techniques to create more human-like expressions and foster creative inspiration, which can help make the experience more engaging and immersive. Additionally, the tool can help elevate the content and make it more engaging for the user.\n\nSupport: The AI-powered tool called PolishTool can help make the experience as immersive as possible by utilizing advanced rewriting techniques to create more human-like expressions and foster creative inspiration. This tool can help elevate the content and make it more engaging for the user.\n\nExplanation: The AI-powered tool called PolishTool is designed to help users improve their writing by providing advanced rewriting techniques that can create more human-like expressions and foster creative inspiration. This tool can help make the experience more engaging and immersive by providing users with more engaging and immersive content. Additionally, the tool can help elevate the content and make it more engaging for the user."}