Closed BillChou5555 closed 1 month ago
Thank you for your attention to our work. Currently, our BoT v1 exhibits high accuracy when deployed on powerful language models such as GPT-4 and Llama3-70B. The lower accuracy results you have encountered are attributed to the occasional difficulty of Llama3-8B-Instruct in effectively instantiating thought templates when faced with complex reasoning tasks. This issue will be addressed in our upcoming updates as we strive to support the usage of smaller models. Please stay tuned for our imminent improvements.
作者您好,我使用了Metallama-8B-Instruct模型跑checkmate的结果,得到的准确率为4%,和您在论文中报告的56.7%差距过大,请问可能出现什么问题了呢?我对run_benchmark中的代码做了细微的修改,以下是我的代码。
#run_benchmarks_modify.py import json from bot_pipeline_modify import BoT import argparse parser = argparse.ArgumentParser() parser.add_argument('--task_name',type=str,default='gameof24',choices=['gameof24','checkmate','wordsorting']) parser.add_argument('--api_key',type=str,help='input your api key here') parser.add_argument('--model_id',type=str,default='gpt-4o',help='Input model id here, if use local model, input the path to the local model') GameOf24 = """ Let's play a game called 24. You'll be given four integers, and your objective is to use each number only once, combined with any of the four arithmetic operations (addition, subtraction, multiplication, and division) and parentheses, to achieve a total of 24. For example, if the input is 4, 7, 8, and 8, the output could be 7 * 8 - 4 * 8 = 24. You only need to find one feasible solution! Input: """ CheckmateInOne = """ Given a series of chess moves written in Standard Algebraic Notation (SAN), determine the next move that will result in a checkmate. Input: """ WordSorting = """ Sort a list of words alphabetically, placing them in a single line of text separated by spaces. Input: """ if __name__ == "__main__": args = parser.parse_args() task = args.task_name api_key = args.api_key model_id = args.model_id benchmark_dict = { 'gameof24':GameOf24, 'checkmate':CheckmateInOne, 'wordsorting':WordSorting } path_dict = { 'gameof24':'benchmarks/gameof24.jsonl', 'checkmate':'benchmarks/CheckmateInOne.jsonl', 'wordsorting':'benchmarks/word_sorting.jsonl' } buffer_dict = { 'gameof24':0, 'checkmate':1, 'wordsorting':2 } user_prompt = benchmark_dict[task] path = path_dict[task] problem_id = buffer_dict[task] test_bot = BoT( user_input = None, problem_id= problem_id, api_key= api_key, model_id= model_id ) for line in (open(path)): input = json.loads(line)['input'] user_input = user_prompt + input test_bot.set_problem_id(problem_id=problem_id)# 修改之处:self.problem_id=problem_id test_bot.set_user_input(user_input=user_input) result = test_bot.bot_run() tmp = {'input':input,'result':result} with open(f'test_results/BoT_{task}_modify.jsonl', 'a+', encoding='utf-8') as file: json_str = json.dumps(tmp) file.write(json_str + '\n')
运行的结果见附件 Uploading testResult.zip…
Thanks for your attention! We will provide the code for small LLMs (e.g., Llama3-8B) for reproduction within two weeks. Thanks.
Thank you for your timely reply and look forward to your updates.
作者您好,我使用了Metallama-8B-Instruct模型跑checkmate的结果,得到的准确率为4%,和您在论文中报告的56.7%差距过大,请问可能出现什么问题了呢?我对run_benchmark中的代码做了细微的修改,以下是我的代码。
运行的结果见附件 [Uploading testResult.zip…]()