Open bingoohe opened 1 month ago
@zuxin666 Could you take a look at the issue regarding xLAM-7b-fc-r??
Hi @bingoohe , we haven't tried using vllm to deploy the GGUF versions of xLAM models. As the vllm official doc said, it might still be an experimental feature. So we would suggest you use our non-quantized versions when deployed with vllm, or follow the deployment instructions here for the quantized models.
Hi! This is a great job. I have tried using the vLLM deployment model. The vLLM service can be started normally, but the following error occurs when the service is invoked. openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', 'type': 'BadRequestError', 'param': None, 'code': 400}
The service startup command is as follows: CUDA_VISIBLE_DEVICES=0 vllm serve ./xLAM-7b-fc/xLAM-7b-fc-r.Q5_K_M.gguf \ --trust-remote-code \ --served-model-name xlam-7b-fc \ --port 4040 \ --api-key agent-model \ --gpu-memory-utilization 0.5
vllm=0.6.0 transformers=4.43.2
from openai import OpenAI client = OpenAI() messages=[] messages.append({"role": "user", "content": "你好"}) result = client.chat.completions.create(messages=messages, model=model_name, temperature=0)