OpenBMB / ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
https://openbmb.github.io/ToolBench/
Apache License 2.0
4.62k stars 397 forks source link

ToolLLaMa outputs inappropriate responses in candidate ranking. #225

Open TakuyaHiraoka opened 6 months ago

TakuyaHiraoka commented 6 months ago

Very great work!

I'm currently trying to use ToolLLaMa with DFSDT, but it seems that ToolLLaMa outputs inappropriate responses in the candidate ranking in DFSDT.

I use ToolLLaMa with the following command:

python toolbench/inference/qa_pipeline.py --tool_root_dir data/toolenv/tools/ --backbone_model toolllama --model_path ToolBench/ToolLLaMA-2-7b-v2 --max_observation_length 1024 --observ_compress_method truncate --method DFS_woFiler_w2 --input_query_file data/instruction/inference_query_demo.json --output_answer_file ./toolllama_dfs_inference_result --rapidapi_key My_RAPID_API_KEY --use_rapidapi_key

Candidate ranking is performed in toolbench/inference/LLM_rank/rank_candidate.py:

def rank2_subfix(llm_interface,LLM_rank_args, cand1,cand2):
    anscestor_interesction = tree_node.find_ancestor_intersection(cand1,cand2)
    assert anscestor_interesction != None
    intersect_trice = anscestor_interesction.get_former_trice_from_this_node(end_node=None)
    trice_1 = cand1.get_former_trice_from_this_node(end_node=anscestor_interesction)
    trice_2 = cand2.get_former_trice_from_this_node(end_node=anscestor_interesction)

    system_message = LLM_PAIRWISE_RANK_SUBFIX_SYSTEM_PROMPT
    system_message = system_message.replace("{task_description}", LLM_rank_args["task_description"])
    system_message = system_message.replace("{intersect_trice}", intersect_trice)
    system_message = system_message.replace("{candidate_A}",trice_1)
    system_message = system_message.replace("{candidate_B}",trice_2)
    llm_interface.change_messages([{"role":"system","content":system_message},
                                   {"role":"user","content":LLM_PAIRWISE_RANK_USER_PROMPT},
                                   ])
    output,error_code, total_tokens = llm_interface.parse(functions=[],function_call="none",process_id=LLM_rank_args["process_id"])

    if output["content"].strip().lower()[-1] == "a": %  String containing "Candidate A" or "Candidate B" is expected
        return 1, 1, total_tokens
    else:
        return 0, 1, total_tokens

Now, let the inputs (system_message, LLM_PAIRWISE_RANK_USER_PROMPT, and LLM_rank_args) to ToolLLaMa (llm_interface) be:

system_message :
You are value-GPT, which is an expert of defining which trail is better, which trail is more close to solving the task.
All candidate tries to solve this task with some funciton calls:
*******************************
{{TASK_DESCRIPTION}}
You should use functions to help handle the real time user querys. Remember:
1.ALWAYS call "Finish" function at the end of the task. And the final answer should contain enough information to show to the user,If you can't handle the task, or you find that function calls always fail(the function is not valid now), use function Finish->give_up_and_restart.
2.Do not use origin tool names, use only subfunctions' names.
You have access of the following tools:
1.olato_quotes: Olato Quote APIs gives you random quotes about motivation, sucess and love quotes.

{{END_TASK_DESCRIPTION}}
*******************************
First, all candidate do the following things:
None
After that, there are two candidates A and B, they do different things:
*******************************
{{CANDIDATE_A_START}}
step_1: Thought: I am calling the "love_quote_for_olato_quotes" function from the "olato_quotes" tool. This function will provide me with random quotes about love. By using this function, I can fulfill the user's request for meaningful quotes to include in the party decorations. The quotes will celebrate love and add a special touch to the surprise party for the user's best friend.

step_2: Action: love_quote_for_olato_quotes

step_3: Action Input: {}

step_4: Observation: {"error": "", "response": "{'quote': 'I have waited for this opportunity for more than half a century, to repeat to you once again my vow of eternal fidelity and everlasting love. -Gabriel Garcia Marquez'}"}

{{CANDIDATE_A_END}}
*******************************
{{CANDIDATE_B_START}}
step_1: Thought: Based on the previous action, it seems that I tried to retrieve a random love quote using the "love_quote_for_olato_quotes" function. However, the function call failed and returned an error.

To generate different actions, I can try using the "success_quote_for_olato_quotes" function to retrieve random quotes about success. This will provide me with a variety of quotes that can celebrate love, success, and inspire everyone at the party.

I will now call the "success_quote_for_olato_quotes" function to retrieve random quotes about success.

step_2: Action: success_quote_for_olato_quotes

step_3: Action Input: {}

step_4: Observation: {"error": "", "response": "{'quote': 'Success is walking from failure to failure with no loss of enthusiasm. -Winston Churchill'}"}

{{CANDIDATE_B_END}}
Which try do you think is more helpful to solving the task?
LLM_PAIRWISE_RANK_USER_PROMPT :
Tell me which candidate is better in ONE Word: "A" or "B":
LLM_rank_args : 
{'functions': [{'name': 'love_quote_for_olato_quotes', 'description': 'This is the subfunction for tool "olato_quotes", you can use this tool.The description of this function is: "It shows random quotes"', 'parameters': {'type': 'object', 'properties': {}, 'required': [], 'optional': []}}, {'name': 'success_quote_for_olato_quotes', 'description': 'This is the subfunction for tool "olato_quotes", you can use this tool.The description of this function is: "It shows random quotes"', 'parameters': {'type': 'object', 'properties': {}, 'required': [], 'optional': []}}, {'name': 'motivation_quote_for_olato_quotes', 'description': 'This is the subfunction for tool "olato_quotes", you can use this tool.The description of this function is: "It shows random quotes"', 'parameters': {'type': 'object', 'properties': {}, 'required': [], 'optional': []}}, {'name': 'Finish', 'description': 'If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are unable to proceed with the task in the current state, call this function to restart. Remember: you must ALWAYS call this function at the end of your attempt, and the only part that will be shown to the user is the final answer, so it should contain sufficient information.', 'parameters': {'type': 'object', 'properties': {'return_type': {'type': 'string', 'enum': ['give_answer', 'give_up_and_restart']}, 'final_answer': {'type': 'string', 'description': 'The final answer you want to give the user. You should have this field if "return_type"=="give_answer"'}}, 'required': ['return_type']}}], 'process_id': 0, 'task_description': 'You should use functions to help handle the real time user querys. Remember:\n1.ALWAYS call "Finish" function at the end of the task. And the final answer should contain enough information to show to the user,If you can\'t handle the task, or you find that function calls always fail(the function is not valid now), use function Finish->give_up_and_restart.\n2.Do not use origin tool names, use only subfunctions\' names.\nYou have access of the following tools:\n1.olato_quotes: Olato Quote APIs gives you random quotes about motivation, sucess and love quotes.\n', 'rank_func': <function rank2_subfix at XXXX>}

Then, the ToolLLaMa's response is :

output : {'role': 'assistant', 'content': 'Based on the previous actions, it seems that both candidates tried to retrieve random quotes about different topics, but the function calls failed.\n\nTo generate a different action, I can try using the "motivation_quote_for_olato_quotes" function to retrieve random quotes about motivation. This will provide me with a new set of quotes that can inspire and motivate everyone at the party.\n\nI will now call the "motivation_quote_for_olato_quotes" function to retrieve random quotes about motivation.\n\nstep_2: Action: motivation_quote_for_olato_quotes\n\nstep_3: Action Input: {', 'function_call': {'name': 'motivation_quote_for_olato_quotes\n\nstep_3: Action Input: {', 'arguments': '{}'}}

However, in the candidate ranking, responses containing "A" or "B" are expected, and thus the above response is inappropriate.

def rank2_subfix(llm_interface,LLM_rank_args, cand1,cand2):
   ... 
    if output["content"].strip().lower()[-1] == "a": %  String containing "Candidate A" or "Candidate B" is expected
        return 1, 1, total_tokens
    else:
        return 0, 1, total_tokens

Is this inappropriate response due to my incorrect use of ToolLLaMa? (or is this just an ignorable corner case?)

code-wanderlust commented 3 months ago

I am also suffering from this issue. Is there a solution?