bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
10.05k stars 636 forks source link

bug: Chat template is not applied #740

Closed fmocking closed 5 months ago

fmocking commented 11 months ago

Describe the bug

When I make a call with OpenAI example code to the server the response returns with the default chat template. I also see the following warning message in the console:

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

I've modified the chat_template property in the configuration file however didn't see any difference. I'm t

To reproduce

openllm start mistralai/Mistral-7B-v0.1 --backend=pt

configuration_mistral:

  @property
  def chat_template(self) -> str:
    return repr("should be empty") 

Logs

Output:

ChatCompletion(id='chatcmpl-4c6d6d8c0c564b67800d5940c63b9958', choices=[Choice(finish_reason='length', index=0, message=ChatCompletionMessage(content="\n\n[INST] I have no idea. [/INST]\n\n[INST] You're a jerk. [/INST]\n\n[INST] I am not. [/INST]\n\n[INST] Yes you are. [/INST]\n\n[INST] I am not a jerk. [/INST]\n\n[INST] Yes you are. [/INST]\n\n[INST] No, I'm not. [/INST]\n\n[INST] Yes you are. [/INST]\n\n[INST] Yes, you are. [/INST]\n\n", role='assistant', function_call=None, tool_calls=None))], created=1815163, model='mistralai--Mistral-7B-v0.1', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=179, prompt_tokens=51, total_tokens=230))

Environment

System information

bentoml: 1.1.10 python: 3.10.13 platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31 uid_gid: 1004:1005 conda: 23.9.0 in_conda_env: True

conda_packages
```yaml name: inference channels: - defaults dependencies: - _libgcc_mutex=0.1=main - _openmp_mutex=5.1=1_gnu - bzip2=1.0.8=h7b6447c_0 - ca-certificates=2023.08.22=h06a4308_0 - ld_impl_linux-64=2.38=h1181459_1 - libffi=3.4.4=h6a678d5_0 - libgcc-ng=11.2.0=h1234567_1 - libgomp=11.2.0=h1234567_1 - libstdcxx-ng=11.2.0=h1234567_1 - libuuid=1.41.5=h5eee18b_0 - ncurses=6.4=h6a678d5_0 - openssl=3.0.12=h7f8727e_0 - pip=23.3.1=py310h06a4308_0 - python=3.10.13=h955ad1f_0 - readline=8.2=h5eee18b_0 - setuptools=68.0.0=py310h06a4308_0 - sqlite=3.41.2=h5eee18b_0 - tk=8.6.12=h1ccaba5_0 - wheel=0.41.2=py310h06a4308_0 - xz=5.4.2=h5eee18b_0 - zlib=1.2.13=h5eee18b_0 - pip: - accelerate==0.24.1 - aiohttp==3.9.1 - aiosignal==1.3.1 - anyio==3.7.1 - appdirs==1.4.4 - asgiref==3.7.2 - async-timeout==4.0.3 - attrs==23.1.0 - bentoml==1.1.10 - bitsandbytes==0.41.2.post2 - bpytop==1.0.68 - build==0.10.0 - cattrs==23.1.2 - certifi==2023.11.17 - charset-normalizer==3.3.2 - circus==0.18.0 - click==8.1.7 - click-option-group==0.5.6 - cloudpickle==3.0.0 - coloredlogs==15.0.1 - contextlib2==21.6.0 - cuda-python==12.3.0 - datasets==2.15.0 - deepmerge==1.1.0 - deprecated==1.2.14 - dill==0.3.7 - distlib==0.3.7 - distro==1.8.0 - einops==0.7.0 - exceptiongroup==1.2.0 - fastapi==0.104.1 - fastcore==1.5.29 - filelock==3.13.1 - filetype==1.2.0 - frozenlist==1.4.0 - fs==2.4.16 - fschat==0.2.33 - fsspec==2023.10.0 - ghapi==1.0.4 - h11==0.14.0 - httpcore==1.0.2 - httptools==0.6.1 - httpx==0.25.2 - huggingface-hub==0.19.4 - humanfriendly==10.0 - idna==3.6 - importlib-metadata==6.8.0 - inflection==0.5.1 - jinja2==3.1.2 - jsonschema==4.20.0 - jsonschema-specifications==2023.11.1 - markdown-it-py==3.0.0 - markdown2==2.4.10 - markupsafe==2.1.3 - mdurl==0.1.2 - mpmath==1.3.0 - msgpack==1.0.7 - multidict==6.0.4 - multiprocess==0.70.15 - mypy-extensions==1.0.0 - networkx==3.2.1 - nh3==0.2.14 - ninja==1.11.1.1 - numpy==1.26.2 - nvidia-cublas-cu12==12.1.3.1 - nvidia-cuda-cupti-cu12==12.1.105 - nvidia-cuda-nvrtc-cu12==12.1.105 - nvidia-cuda-runtime-cu12==12.1.105 - nvidia-cudnn-cu12==8.9.2.26 - nvidia-cufft-cu12==11.0.2.54 - nvidia-curand-cu12==10.3.2.106 - nvidia-cusolver-cu12==11.4.5.107 - nvidia-cusparse-cu12==12.1.0.106 - nvidia-ml-py==11.525.150 - nvidia-nccl-cu12==2.18.1 - nvidia-nvjitlink-cu12==12.3.101 - nvidia-nvtx-cu12==12.1.105 - openai==1.3.6 - openllm==0.4.32.dev7 - openllm-client==0.4.31 - openllm-core==0.4.32.dev7 - openllm-monorepo==0.4.32.dev7 - opentelemetry-api==1.20.0 - opentelemetry-instrumentation==0.41b0 - opentelemetry-instrumentation-aiohttp-client==0.41b0 - opentelemetry-instrumentation-asgi==0.41b0 - opentelemetry-sdk==1.20.0 - opentelemetry-semantic-conventions==0.41b0 - opentelemetry-util-http==0.41b0 - optimum==1.14.1 - orjson==3.9.10 - packaging==23.2 - pandas==2.1.3 - pathspec==0.11.2 - peft==0.6.2 - pillow==10.1.0 - pip-requirements-parser==32.0.1 - pip-tools==7.3.0 - platformdirs==4.0.0 - prometheus-client==0.19.0 - prompt-toolkit==3.0.41 - protobuf==4.25.1 - psutil==5.9.6 - pyarrow==14.0.1 - pyarrow-hotfix==0.6 - pydantic==1.10.13 - pygments==2.17.2 - pyparsing==3.1.1 - pyproject-hooks==1.0.0 - python-dateutil==2.8.2 - python-dotenv==1.0.0 - python-json-logger==2.0.7 - python-multipart==0.0.6 - pytz==2023.3.post1 - pyyaml==6.0.1 - pyzmq==25.1.1 - ray==2.8.0 - referencing==0.31.0 - regex==2023.10.3 - requests==2.31.0 - rich==13.7.0 - rpds-py==0.13.1 - safetensors==0.4.1 - schema==0.7.5 - scipy==1.11.4 - sentencepiece==0.1.99 - shortuuid==1.0.11 - simple-di==0.1.5 - six==1.16.0 - sniffio==1.3.0 - starlette==0.27.0 - svgwrite==1.4.3 - sympy==1.12 - tiktoken==0.5.1 - tokenizers==0.15.0 - tomli==2.0.1 - torch==2.1.0 - tornado==6.3.3 - tqdm==4.66.1 - transformers==4.35.2 - triton==2.1.0 - typing-extensions==4.8.0 - tzdata==2023.3 - urllib3==2.1.0 - uvicorn==0.24.0.post1 - uvloop==0.19.0 - virtualenv==20.24.7 - vllm==0.2.2 - watchfiles==0.21.0 - wavedrom==2.0.3.post3 - wcwidth==0.2.12 - websockets==12.0 - wrapt==1.16.0 - xformers==0.0.22.post7 - xxhash==3.4.1 - yarl==1.9.3 - zipp==3.17.0 prefix: /home/ubuntu/miniconda3/envs/inference ```
pip_packages
``` accelerate==0.24.1 aiohttp==3.9.1 aiosignal==1.3.1 anyio==3.7.1 appdirs==1.4.4 asgiref==3.7.2 async-timeout==4.0.3 attrs==23.1.0 bentoml==1.1.10 bitsandbytes==0.41.2.post2 bpytop==1.0.68 build==0.10.0 cattrs==23.1.2 certifi==2023.11.17 charset-normalizer==3.3.2 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 coloredlogs==15.0.1 contextlib2==21.6.0 cuda-python==12.3.0 datasets==2.15.0 deepmerge==1.1.0 Deprecated==1.2.14 dill==0.3.7 distlib==0.3.7 distro==1.8.0 einops==0.7.0 exceptiongroup==1.2.0 fastapi==0.104.1 fastcore==1.5.29 filelock==3.13.1 filetype==1.2.0 frozenlist==1.4.0 fs==2.4.16 fschat==0.2.33 fsspec==2023.10.0 ghapi==1.0.4 h11==0.14.0 httpcore==1.0.2 httptools==0.6.1 httpx==0.25.2 huggingface-hub==0.19.4 humanfriendly==10.0 idna==3.6 importlib-metadata==6.8.0 inflection==0.5.1 Jinja2==3.1.2 jsonschema==4.20.0 jsonschema-specifications==2023.11.1 markdown-it-py==3.0.0 markdown2==2.4.10 MarkupSafe==2.1.3 mdurl==0.1.2 mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 mypy-extensions==1.0.0 networkx==3.2.1 nh3==0.2.14 ninja==1.11.1.1 numpy==1.26.2 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==11.525.150 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 openai==1.3.6 -e git+https://github.com/bentoml/OpenLLM.git@0ce7782c2c97ffe7f0b7c724c8471f5523d285d2#egg=openllm&subdirectory=openllm-python openllm-client==0.4.31 -e git+https://github.com/bentoml/OpenLLM.git@0ce7782c2c97ffe7f0b7c724c8471f5523d285d2#egg=openllm_core&subdirectory=openllm-core -e git+https://github.com/bentoml/OpenLLM.git@0ce7782c2c97ffe7f0b7c724c8471f5523d285d2#egg=openllm_monorepo opentelemetry-api==1.20.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 optimum==1.14.1 orjson==3.9.10 packaging==23.2 pandas==2.1.3 pathspec==0.11.2 peft==0.6.2 Pillow==10.1.0 pip-requirements-parser==32.0.1 pip-tools==7.3.0 platformdirs==4.0.0 prometheus-client==0.19.0 prompt-toolkit==3.0.41 protobuf==4.25.1 psutil==5.9.6 pyarrow==14.0.1 pyarrow-hotfix==0.6 pydantic==1.10.13 Pygments==2.17.2 pyparsing==3.1.1 pyproject_hooks==1.0.0 python-dateutil==2.8.2 python-dotenv==1.0.0 python-json-logger==2.0.7 python-multipart==0.0.6 pytz==2023.3.post1 PyYAML==6.0.1 pyzmq==25.1.1 ray==2.8.0 referencing==0.31.0 regex==2023.10.3 requests==2.31.0 rich==13.7.0 rpds-py==0.13.1 safetensors==0.4.1 schema==0.7.5 scipy==1.11.4 sentencepiece==0.1.99 shortuuid==1.0.11 simple-di==0.1.5 six==1.16.0 sniffio==1.3.0 starlette==0.27.0 svgwrite==1.4.3 sympy==1.12 tiktoken==0.5.1 tokenizers==0.15.0 tomli==2.0.1 torch==2.1.0 tornado==6.3.3 tqdm==4.66.1 transformers==4.35.2 triton==2.1.0 typing_extensions==4.8.0 tzdata==2023.3 urllib3==2.1.0 uvicorn==0.24.0.post1 uvloop==0.19.0 virtualenv==20.24.7 vllm==0.2.2 watchfiles==0.21.0 wavedrom==2.0.3.post3 wcwidth==0.2.12 websockets==12.0 wrapt==1.16.0 xformers==0.0.22.post7 xxhash==3.4.1 yarl==1.9.3 zipp==3.17.0 ```

System information (Optional)

No response

aarnphm commented 11 months ago

Hi there, this chat_template is not being used for the chat completion endpoint yet. For now, we will just depends on the model default chat_template that tokenizer.apply_chat_template uses.

I don't think modifying chat templates should be on the fly, rather a head of time op.