checkmate结果差异过大

BillChou5555 commented 1 month ago

作者您好，我使用了Metallama-8B-Instruct模型跑checkmate的结果，得到的准确率为4%，和您在论文中报告的56.7%差距过大，请问可能出现什么问题了呢？我对run_benchmark中的代码做了细微的修改，以下是我的代码。

#run_benchmarks_modify.py

import json
from bot_pipeline_modify import BoT
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--task_name',type=str,default='gameof24',choices=['gameof24','checkmate','wordsorting'])
parser.add_argument('--api_key',type=str,help='input your api key here')
parser.add_argument('--model_id',type=str,default='gpt-4o',help='Input model id here, if use local model, input the path to the local model')

GameOf24 = """
Let's play a game called 24. You'll be given four integers, and your objective is to use each number only once, combined with any of the four arithmetic operations (addition, subtraction, multiplication, and division) and parentheses, to achieve a total of 24. For example, if the input is 4, 7, 8, and 8, the output could be 7 * 8 - 4 * 8 = 24. You only need to find one feasible solution!
Input:
"""
CheckmateInOne = """
Given a series of chess moves written in Standard Algebraic Notation (SAN), determine the next move that will result in a checkmate.
Input: 
"""
WordSorting = """
Sort a list of words alphabetically, placing them in a single line of text separated by spaces.
Input:
"""

if __name__ == "__main__":
    args = parser.parse_args()
    task = args.task_name
    api_key = args.api_key
    model_id = args.model_id
    benchmark_dict = {
        'gameof24':GameOf24,
        'checkmate':CheckmateInOne,
        'wordsorting':WordSorting
    }

    path_dict = {
        'gameof24':'benchmarks/gameof24.jsonl',
        'checkmate':'benchmarks/CheckmateInOne.jsonl',
        'wordsorting':'benchmarks/word_sorting.jsonl'
    }

    buffer_dict = {
        'gameof24':0,
        'checkmate':1,
        'wordsorting':2

    }

    user_prompt = benchmark_dict[task]
    path = path_dict[task]    
    problem_id = buffer_dict[task]

    test_bot = BoT(
    user_input = None,
    problem_id= problem_id,
    api_key= api_key,
    model_id= model_id
    )

    for line in (open(path)):
        input = json.loads(line)['input']
        user_input = user_prompt + input

        test_bot.set_problem_id(problem_id=problem_id)# 修改之处：self.problem_id=problem_id
        test_bot.set_user_input(user_input=user_input)

        result = test_bot.bot_run()
        tmp = {'input':input,'result':result}
        with open(f'test_results/BoT_{task}_modify.jsonl', 'a+', encoding='utf-8') as file:
            json_str = json.dumps(tmp)
            file.write(json_str + '\n')

运行的结果见附件 [Uploading testResult.zip…]()

BitCodingWalkin commented 1 month ago

Thank you for your attention to our work. Currently, our BoT v1 exhibits high accuracy when deployed on powerful language models such as GPT-4 and Llama3-70B. The lower accuracy results you have encountered are attributed to the occasional difficulty of Llama3-8B-Instruct in effectively instantiating thought templates when faced with complex reasoning tasks. This issue will be addressed in our upcoming updates as we strive to support the usage of smaller models. Please stay tuned for our imminent improvements.

YangLing0818 commented 1 month ago

作者您好，我使用了Metallama-8B-Instruct模型跑checkmate的结果，得到的准确率为4%，和您在论文中报告的56.7%差距过大，请问可能出现什么问题了呢？我对run_benchmark中的代码做了细微的修改，以下是我的代码。

#run_benchmarks_modify.py

import json
from bot_pipeline_modify import BoT
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--task_name',type=str,default='gameof24',choices=['gameof24','checkmate','wordsorting'])
parser.add_argument('--api_key',type=str,help='input your api key here')
parser.add_argument('--model_id',type=str,default='gpt-4o',help='Input model id here, if use local model, input the path to the local model')

GameOf24 = """
Let's play a game called 24. You'll be given four integers, and your objective is to use each number only once, combined with any of the four arithmetic operations (addition, subtraction, multiplication, and division) and parentheses, to achieve a total of 24. For example, if the input is 4, 7, 8, and 8, the output could be 7 * 8 - 4 * 8 = 24. You only need to find one feasible solution!
Input:
"""
CheckmateInOne = """
Given a series of chess moves written in Standard Algebraic Notation (SAN), determine the next move that will result in a checkmate.
Input: 
"""
WordSorting = """
Sort a list of words alphabetically, placing them in a single line of text separated by spaces.
Input:
"""

if __name__ == "__main__":
    args = parser.parse_args()
    task = args.task_name
    api_key = args.api_key
    model_id = args.model_id
    benchmark_dict = {
        'gameof24':GameOf24,
        'checkmate':CheckmateInOne,
        'wordsorting':WordSorting
    }

    path_dict = {
        'gameof24':'benchmarks/gameof24.jsonl',
        'checkmate':'benchmarks/CheckmateInOne.jsonl',
        'wordsorting':'benchmarks/word_sorting.jsonl'
    }

    buffer_dict = {
        'gameof24':0,
        'checkmate':1,
        'wordsorting':2

    }

    user_prompt = benchmark_dict[task]
    path = path_dict[task]    
    problem_id = buffer_dict[task]

    test_bot = BoT(
    user_input = None,
    problem_id= problem_id,
    api_key= api_key,
    model_id= model_id
    )

    for line in (open(path)):
        input = json.loads(line)['input']
        user_input = user_prompt + input

        test_bot.set_problem_id(problem_id=problem_id)# 修改之处：self.problem_id=problem_id
        test_bot.set_user_input(user_input=user_input)

        result = test_bot.bot_run()
        tmp = {'input':input,'result':result}
        with open(f'test_results/BoT_{task}_modify.jsonl', 'a+', encoding='utf-8') as file:
            json_str = json.dumps(tmp)
            file.write(json_str + '\n')

运行的结果见附件 Uploading testResult.zip…

Thanks for your attention! We will provide the code for small LLMs (e.g., Llama3-8B) for reproduction within two weeks. Thanks.

BillChou5555 commented 1 month ago

Thank you for your timely reply and look forward to your updates.

YangLing0818 / buffer-of-thought-llm

checkmate结果差异过大 #3