[RFC] Refactor chat template and remove model name from engine config

AllentDan commented 7 months ago

Motivation

Decoupling dialogue templates from the inference engine.
Reduce the barrier to adding new dialogue templates.
Remove model_name from EngineConfig to avoid redundant specification.
Support external dialogue templates compatible with Transformers.

Major features

The Tokenizer class supports Transformers’ Jinja dialogue templates.
Original model.get_prompt will be moved to Tokenizer.apply_chat_template
model_name is removed from TurbomindEngineConfig and PytorchEngineConfig.

How to use

For api_server, to use an extra template, commands could be:

lmdeploy serve api_server $MODEL_PATH --chat-template $JINJIA

For APIs like pipeline, we are going to provide documents to show how to add a chat template in python or Jinjia. The codes will be:

chat_template = PythonTemplate() # or a function or a Jinjia str or file path
input_inds = tokenizer.apply_chat_template(messages, chat_template=chat_template)
pipeline(input_ids)

lvhan028 commented 7 months ago

It's better to deprecate model_name instead of removing it directly. PR #1022 is working on it for TurbomindEngineConfig
For the pipeline case, are you suggesting changing prompts to input_ids?

AllentDan commented 7 months ago

It's better to deprecate model_name instead of removing it directly. PR Remove model name when loading hf model #1022 is working on it for TurbomindEngineConfig

For the pipeline case, are you suggesting changing prompts to input_ids?

Removing model_name refers to PytorchEngineConfig then. model_name still exists in ChatTemplateConfig whic is not gonna be deprecated.
Not implemented yet. Maybe I will provide a function for user to set chat_template to tokenizer in pipeline.

irexyc commented 7 months ago

Should we consider making the chat template or model_name a required argument? The matching strategy is quite prone to failure or mismatching to the wrong template.

AllentDan commented 7 months ago

I prefer an accurate matching like FastChat.

AllentDan commented 7 months ago

The /v1/completions interface does not consider any dialogue template issues and only accepts an input string, which is directly processed by the tokenizer and then inferred. Therefore, only /v1/chat/completions and /v1/chat/interactive are discussed:

No model_name, no jinja-template:
- /v1/chat/completions prioritizes reading the Jinja template from tokenizer_config.json, and if not available, uses the one defined by LMDeploy.
- /v1/chat/interactive uses the one defined by LMDeploy.
model_name provided, no jinja-template:
- Use the dialogue template corresponding to the specified model_name in LMDeploy format.
- Both interfaces use the dialogue template defined by LMDeploy.
No model_name provided, jinja-template provided:
- Only ensure that /v1/chat/completions uses the jinja-template.
- The behavior of /v1/chat/interactive is not guaranteed, and it usually responds to the request as text completion.
model_name provided, jinja-template provided:
- If model_name finds a dialogue template, all interfaces use the LMDeploy dialogue template.
- If model_name does not find a dialogue template, an error is reported directly, and the program cannot be started.

InternLM / lmdeploy