[问题] vllm-wrapper中使用make_context引入了chatml template，这与直接使用llm.generate一致吗

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在https://qwen.readthedocs.io/en/latest/deployment/vllm.html 文档和 qwen-wrapper代码中，输入是被chatml template包装一下在这篇(modelscope)[https://developer.aliyun.com/article/1380325] 中以及所有https://github.com/vllm-project/vllm/tree/main/examples的例子中，都是直接使用vllm的llm generate进行推理，没有使用模型对应的chat模版

我该如何正确在千问模型上使用vllm呢。

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Please be advised that this is the Qwen(1.0) repository, and it appears you are using Qwen1.5 models.

请注意，这是Qwen(1.0)的存储库，而您希望使用的是Qwen1.5的模型。

Firstly, it's crucial to distinguish between base models (e.g., Qwen-7B) and chat models (e.g., Qwen-7B-Chat), as they are distinct model types. Base models only support text continuation, whereas chat models facilitate conversation through a specific template - in the case of Qwen, ChatML format is adopted for its chat models. Generally, for chat models, you need to apply this template to your input at some point during the process.

首先，必须明确区分基模型（例如Qwen-7B）和对话模型（例如Qwen-7B-Chat），它们是不同类型的模型。基础模型仅支持文本续写功能，而对话模型通过一种特殊的模板（Qwen-Chat采用ChatML格式）来支持对话。通常情况下，对于对话模型，需要在某个环节将此模板应用于输入内容。

Secondly, there are notable differences between Qwen 1.0 and Qwen 1.5, with each employing different methods to implement the chat template.

其次，Qwen 1.0和Qwen 1.5之间也存在显著差异，并且各自采用了不同的方法来应用对话模板。

Qwen 1.0 models relied on custom code which necessitated trust_remote_code=True, and the QwenTokenizer in this version did not support the new apply_chat_template method. For chat models in Qwen 1.0, input token IDs were manually structured according to the template and then passed into either model.generate or llm.generate. This approach also ensured control token injection was avoided. Qwen 1.0版本的模型使用了自定义代码，要求设置trust_remote_code=True，并且QwenTokenizer不支持新的apply_chat_template方法。在这种版本中，对于对话模型，需按照模板手动构建输入的token ID，然后传递给model.generate或llm.generate方法进行生成。这一做法同时确保了控制token不会被随意注入。
As the transformers library and its ecosystem have evolved, a de facto standard has emerged within the community (including vLLM, FastChat, and others). This standard involves constructing chat model inputs as text first, followed by encoding the text into token IDs. The transformers library now includes an apply_chat_template method in tokenizer classes to accommodate this practice. Therefore, Qwen2Tokenizer adheres to this trend, which explains why the line tokenizer.apply_chat_template appears frequently in current implementations. 随着transformers库及其生态系统的不断发展，社区内（包括vLLM、FastChat等项目）已经形成了一种_事实上的_标准做法：先将对话模型的输入构造为文本形式，然后将其编码为token ID。为了满足这种需求，transformers库已在其tokenizer类中添加了apply_chat_template方法。因此，Qwen2Tokenizer紧跟这一方案。这就是现在到处可以看到tokenizer.apply_chat_template这一行代码的原因。

In all, as of the current date, you should use apply_chat_template for latest chat models to enjoy the benefits from the broad support from the community.

总之，对于对话模型和最新版本的模型，您应使用apply_chat_template构造模型输入，以享受广泛的社区支持所带来的便利。

QwenLM / Qwen