glm4-9b-chat和LoRA微调模型merge之后，使用vLLM推理，工具调用功能报错。

System Info / 系統信息

vllm 0.5.3
transformer 4.44.0
torch 2.3.1

Who can help? / 谁可以帮助到您？

@sixsixcoder @zr

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1. LoRA微调

使用LLaMA-Factory，自定义数据集、yaml文件，llamafactory-cli train进行LoRA微调。

2.glm4-9b-chat和LoRA进行merge

使用LLaMA-Factory，llamafactory-cli export 得到merge模型。

3.工具调用测试代码

使用官方提供的openai_api_server.py, vLLM推理。

工具测试部分代码：

tools = {
    "weather": weather,
}

# 绑定工具
llm_with_tools = llm.bind_tools(list(tools.values()))

context = []
def process_query(query):

    global context
    # 将用户的查询添加到上下文中
    context.append({"role": "user", "content": query})

    # 调用 LLM
    response = llm_with_tools.invoke(context)
    print(response)

    if response.tool_calls:
        # 如果有工具调用，则执行工具调用
        tool_call = response.tool_calls[0]
        tool_name = tool_call["name"]
        tool = tools[tool_name]

        # 获取工具调用的参数并解包传递给工具函数
        tool_arguments = tool_call["args"]
        tool_result = tool(**tool_arguments) 

        # 将工具结果添加到上下文
        context.append({"role": "system", "content": f"你可以通过工具得到实时的天气信息，工具得到的结果是：\n\n{tool_result}\n\n，这个结果绝对准确，你可以直接使用该结果进行表述。"})

        # 工具调用后的上下文继续传递给 LLM，以生成最终响应
        response = llm.invoke(context)

    # 将 LLM 的响应添加到上下文中
    context.append({"role": "assistant", "content": response.content})

    return response.content

#测试
query_1 = "今天深圳的天气怎么样"
response_1 = process_query(query_1)
print(response_1)

4.模型测试

测试如下，用base和两种不同的lora调用方式测试

glm4-9b-chat模型

merge模型

glm4-9b-chat模型+lora_request参数

glm4-9b-chat模型输出结果：


content='' additional_kwargs={'tool_calls': [{'id': 'call_M7oa0Lip6JIV3yZ17D3ZDSiU', 'function': {'arguments': '{"city": "深圳"}', 'name': 'weather'}, 'type': 'function', 'index': 0}], 'refusal': None} response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 402, 'total_tokens': 411}, 'model_name': 'glm4-9b-chat', 'system_fingerprint': 'fp_jg9prlr6x', 'finish_reason': 'tool_calls', 'logprobs': None} id='run-14ea30ff-3a50-4996-931b-c38f19ee04bc-0' tool_calls=[{'name': 'weather', 'args': {'city': '深圳'}, 'id': 'call_M7oa0Lip6JIV3yZ17D3ZDSiU', 'type': 'tool_call'}] usage_metadata={'input_tokens': 402, 'output_tokens': 9, 'total_tokens': 411}

tools result:

深圳目前时刻的天气是多云，温度为26.0℃，湿度为38.0%，风向为东北，风力为≤3级

LLM_response: 今天深圳的天气是多云，气温约为26摄氏度，相对湿度为38%，风向为东北风，风力较弱，风速不超过3级。

- LoRA merge模型

输出结果： INFO: 172.16.21.155:36244 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/applications.py", line 113, in call await self.middleware_stack(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app raise exc File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app await app(scope, receive, sender) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app raise exc File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app await app(scope, receive, sender) File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/starlette/routing.py", line 73, in app response = await f(request) ^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/ljm/ChatGLM4/GLM-4/api_server/openai_api_server.py", line 389, in create_chat_completion async for response in generate_stream_glm4(gen_params): File "/root/ljm/ChatGLM4/GLM-4/api_server/openai_api_server.py", line 205, in generate_stream_glm4 inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1844, in apply_chat_template rendered_chat = compiled_template.render( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/jinja2/environment.py", line 1304, in render self.environment.handle_exception() File "/root/anaconda3/envs/glm4_9b-chat-128k_vLLM/lib/python3.11/site-packages/jinja2/environment.py", line 939, in handle_exception raise rewrite_traceback_stack(source=source) File "

THUDM / GLM-4