bentoml / OpenLLM

Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
10.11k stars 640 forks source link

bug: llama3.1:8B maximum context length #1083

Open GaetanBaert opened 2 months ago

GaetanBaert commented 2 months ago

Describe the bug

When I try to serve a llama 3.1 8B-4bit with openllm, it says that "This model's maximum context length is 2048 tokens". On https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, it says that the maximum context length is 128k tokens.

Why this difference ?

To reproduce

openllm serve llama3.1:8b-4bit

in a python console with openai client installed :

from openai import OpenAI
openai_client = OpenAI(api_key="test", base_url="http://localhost:3000/v1")
openai_client.chat.completions.create(
            model='hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4',
            messages=[{"role":"user", "content": "This is a test"}],
            presence_penalty=0.,
            frequency_penalty=0.,
            stream=False,
            temperature=0.,
            max_tokens=2048
        )

Logs

On client side : 

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_utils\_utils.py", line 277, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\resources\chat\completions.py", line 590, in create
    return self._post(
           ^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1240, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 921, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1020, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 2048 tokens. However, you requested 2087 tokens (39 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}

Environment

System information

bentoml: 1.3.5 python: 3.11.8 platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.37 uid_gid: 1000:1000 conda: 24.3.0 in_conda_env: True

conda_packages
```yaml name: pytorch channels: - conda-forge - defaults dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=2_gnu - aom=3.9.1=hac33072_0 - bzip2=1.0.8=h5eee18b_5 - ca-certificates=2024.8.30=hbcca054_0 - cairo=1.18.0=hebfffa5_3 - dav1d=1.2.1=hd590300_0 - expat=2.6.3=h5888daf_0 - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 - font-ttf-inconsolata=3.000=h77eed37_0 - font-ttf-source-code-pro=2.038=h77eed37_0 - font-ttf-ubuntu=0.83=h77eed37_2 - fontconfig=2.14.2=h14ed4e7_0 - fonts-conda-ecosystem=1=0 - fonts-conda-forge=1=0 - freetype=2.12.1=h267a509_2 - fribidi=1.0.10=h36c2ea0_0 - gettext=0.22.5=he02047a_3 - gettext-tools=0.22.5=he02047a_3 - gmp=6.3.0=hac33072_2 - gnutls=3.8.7=h32866dd_0 - graphite2=1.3.13=h59595ed_1003 - harfbuzz=9.0.0=hda332d3_1 - icu=75.1=he02047a_0 - lame=3.100=h166bdaf_1003 - ld_impl_linux-64=2.38=h1181459_1 - libabseil=20240116.2=cxx17_he02047a_1 - libasprintf=0.22.5=he8f35ee_3 - libasprintf-devel=0.22.5=he8f35ee_3 - libass=0.17.3=h1dc1e6a_0 - libdrm=2.4.123=hb9d3cd8_0 - libexpat=2.6.3=h5888daf_0 - libffi=3.4.4=h6a678d5_0 - libgcc=14.1.0=h77fa898_1 - libgcc-ng=14.1.0=h69a702a_1 - libgettextpo=0.22.5=he02047a_3 - libgettextpo-devel=0.22.5=he02047a_3 - libglib=2.80.3=h315aac3_2 - libgomp=14.1.0=h77fa898_1 - libhwloc=2.11.1=default_hecaa2ac_1000 - libiconv=1.17=hd590300_2 - libidn2=2.3.7=hd590300_0 - libnsl=2.0.1=hd590300_0 - libopenvino=2024.3.0=h2da1b83_0 - libopenvino-auto-batch-plugin=2024.3.0=hb045406_0 - libopenvino-auto-plugin=2024.3.0=hb045406_0 - libopenvino-hetero-plugin=2024.3.0=h5c03a75_0 - libopenvino-intel-cpu-plugin=2024.3.0=h2da1b83_0 - libopenvino-intel-gpu-plugin=2024.3.0=h2da1b83_0 - libopenvino-intel-npu-plugin=2024.3.0=h2da1b83_0 - libopenvino-ir-frontend=2024.3.0=h5c03a75_0 - libopenvino-onnx-frontend=2024.3.0=h07e8aee_0 - libopenvino-paddle-frontend=2024.3.0=h07e8aee_0 - libopenvino-pytorch-frontend=2024.3.0=he02047a_0 - libopenvino-tensorflow-frontend=2024.3.0=h39126c6_0 - libopenvino-tensorflow-lite-frontend=2024.3.0=he02047a_0 - libopus=1.3.1=h7f98852_1 - libpciaccess=0.18=hd590300_0 - libpng=1.6.44=hadc24fc_0 - libprotobuf=4.25.3=h08a7969_0 - libsqlite=3.45.2=h2797004_0 - libstdcxx=14.1.0=hc0a3c3a_1 - libstdcxx-ng=14.1.0=h4852527_1 - libtasn1=4.19.0=h166bdaf_0 - libunistring=0.9.10=h7f98852_0 - libuuid=2.38.1=h0b41bf4_0 - libva=2.22.0=hb711507_0 - libvpx=1.14.1=hac33072_0 - libxcb=1.16=hb9d3cd8_1 - libxcrypt=4.4.36=hd590300_1 - libxml2=2.12.7=he7c6b58_4 - libzlib=1.3.1=h4ab18f5_1 - ncurses=6.4=h6a678d5_0 - nettle=3.9.1=h7ab15ed_0 - ocl-icd=2.3.2=hd590300_1 - openh264=2.4.1=h59595ed_0 - openssl=3.3.2=hb9d3cd8_0 - p11-kit=0.24.1=hc5aa10d_0 - pcre2=10.44=hba22ea6_2 - pip=23.3.1=py311h06a4308_0 - pixman=0.43.2=h59595ed_0 - pthread-stubs=0.4=h36c2ea0_1001 - pugixml=1.14=h59595ed_0 - python=3.11.8=hab00c5b_0_cpython - readline=8.2=h5eee18b_0 - setuptools=68.2.2=py311h06a4308_0 - snappy=1.2.1=ha2e4443_0 - sqlite=3.45.2=h2c6b66d_0 - svt-av1=2.2.1=h5888daf_0 - tbb=2021.13.0=h84d6215_0 - tk=8.6.13=noxft_h4845f30_101 - wayland=1.23.1=h3e06ad9_0 - wayland-protocols=1.37=hd8ed1ab_0 - wheel=0.41.2=py311h06a4308_0 - x264=1!164.3095=h166bdaf_2 - x265=3.5=h924138e_3 - xorg-fixesproto=5.0=h7f98852_1002 - xorg-kbproto=1.0.7=h7f98852_1002 - xorg-libice=1.1.1=hd590300_0 - xorg-libsm=1.2.4=h7391055_0 - xorg-libx11=1.8.9=hb711507_1 - xorg-libxau=1.0.11=hd590300_0 - xorg-libxdmcp=1.1.3=h7f98852_0 - xorg-libxext=1.3.4=h0b41bf4_2 - xorg-libxfixes=5.0.3=h7f98852_1004 - xorg-libxrender=0.9.11=hd590300_0 - xorg-renderproto=0.11.1=h7f98852_1002 - xorg-xextproto=7.3.0=h0b41bf4_1003 - xorg-xproto=7.0.31=h7f98852_1007 - xz=5.4.6=h5eee18b_0 - zlib=1.3.1=h4ab18f5_1 - pip: - accelerate==0.34.1 - aiohappyeyeballs==2.4.0 - aiohttp==3.10.5 - aiosignal==1.3.1 - aiosqlite==0.20.0 - aniso8601==9.0.1 - annotated-types==0.7.0 - ansi2html==1.9.1 - anyio==4.4.0 - appdirs==1.4.4 - arrow==1.3.0 - asgiref==3.8.1 - attrs==24.2.0 - bentoml==1.3.5 - bitsandbytes==0.43.3 - blinker==1.7.0 - cattrs==23.1.2 - certifi==2024.2.2 - charset-normalizer==3.3.2 - circus==0.18.0 - click==8.1.7 - click-option-group==0.5.6 - cloudpickle==3.0.0 - ctranslate2==4.1.0 - cuda-python==12.6.0 - datasets==3.0.0 - deepmerge==2.0 - deprecated==1.2.14 - dill==0.3.8 - diskcache==5.6.3 - distro==1.9.0 - dulwich==0.22.1 - einops==0.8.0 - enum-compat==0.0.3 - fastapi==0.115.0 - fastcore==1.7.8 - ffmpeg==1.4 - filelock==3.13.4 - flask==3.0.3 - flask-restful==0.3.10 - frozenlist==1.4.1 - fs==2.4.16 - fsspec==2024.3.1 - gguf==0.9.1 - ghapi==1.0.6 - h11==0.14.0 - httpcore==1.0.5 - httptools==0.6.1 - httpx==0.27.2 - httpx-ws==0.6.0 - huggingface-hub==0.24.6 - idna==3.7 - importlib-metadata==6.11.0 - inflection==0.5.1 - inquirerpy==0.3.4 - interegular==0.3.3 - itsdangerous==2.2.0 - jinja2==3.1.2 - jiter==0.5.0 - jsonschema==4.23.0 - jsonschema-specifications==2023.12.1 - lark==1.2.2 - llvmlite==0.43.0 - lm-format-enforcer==0.10.6 - markdown-it-py==3.0.0 - markupsafe==2.1.3 - mdurl==0.1.2 - mistral-common==1.4.1 - mpmath==1.3.0 - msgpack==1.1.0 - msgspec==0.18.6 - multidict==6.1.0 - multiprocess==0.70.16 - mypy-extensions==1.0.0 - nest-asyncio==1.6.0 - networkx==3.2.1 - ninja==1.11.1.1 - numba==0.60.0 - numpy==1.26.4 - nvgpu==0.10.0 - nvidia-cublas-cu12==12.1.3.1 - nvidia-cuda-cupti-cu12==12.1.105 - nvidia-cuda-nvrtc-cu12==12.1.105 - nvidia-cuda-runtime-cu12==12.1.105 - nvidia-cudnn-cu12==9.1.0.70 - nvidia-cufft-cu12==11.0.2.54 - nvidia-curand-cu12==10.3.2.106 - nvidia-cusolver-cu12==11.4.5.107 - nvidia-cusparse-cu12==12.1.0.106 - nvidia-ml-py==11.525.150 - nvidia-nccl-cu12==2.20.5 - nvidia-nvjitlink-cu12==12.1.105 - nvidia-nvtx-cu12==12.1.105 - openai==1.41.0 - opencv-python-headless==4.10.0.84 - openllm==0.6.10 - openllm-client==0.5.7 - openllm-core==0.5.7 - opentelemetry-api==1.20.0 - opentelemetry-instrumentation==0.41b0 - opentelemetry-instrumentation-aiohttp-client==0.41b0 - opentelemetry-instrumentation-asgi==0.41b0 - opentelemetry-sdk==1.20.0 - opentelemetry-semantic-conventions==0.41b0 - opentelemetry-util-http==0.41b0 - orjson==3.10.7 - outlines==0.0.46 - packaging==24.0 - pandas==2.2.2 - partial-json-parser==0.2.1.1.post4 - pathlib==1.0.1 - pathspec==0.12.1 - pfzy==0.3.4 - pillow==10.4.0 - pip-requirements-parser==32.0.1 - prometheus-client==0.20.0 - prometheus-fastapi-instrumentator==7.0.0 - prompt-toolkit==3.0.36 - protobuf==5.28.1 - psutil==5.9.8 - py-cpuinfo==9.0.0 - pyairports==2.1.1 - pyaml==24.7.0 - pyarrow==17.0.0 - pycountry==24.6.1 - pydantic==2.9.2 - pydantic-core==2.23.4 - pygments==2.18.0 - pynvml==11.5.0 - pyparsing==3.1.4 - python-dateutil==2.9.0.post0 - python-dotenv==1.0.1 - python-json-logger==2.0.7 - python-multipart==0.0.9 - pytz==2024.1 - pyyaml==6.0.1 - pyzmq==26.2.0 - questionary==2.0.1 - ray==2.36.0 - referencing==0.35.1 - regex==2024.4.16 - requests==2.32.3 - rich==13.8.1 - rpds-py==0.20.0 - safetensors==0.4.3 - schema==0.7.7 - scipy==1.14.1 - sentencepiece==0.2.0 - shellingham==1.5.4 - simple-di==0.1.5 - six==1.16.0 - sniffio==1.3.1 - starlette==0.38.5 - sympy==1.12 - tabulate==0.9.0 - termcolor==2.4.0 - tiktoken==0.7.0 - tokenizers==0.19.1 - tomli-w==1.0.0 - torch==2.4.1 - torch-model-archiver==0.10.0 - torchaudio==2.4.1 - torchserve==0.11.1 - torchvision==0.19.0 - tornado==6.4.1 - tqdm==4.66.5 - transformers==4.44.2 - triton==3.0.0 - typer==0.12.5 - types-python-dateutil==2.9.0.20240316 - typing-extensions==4.11.0 - tzdata==2024.1 - urllib3==2.2.1 - uv==0.4.11 - uvicorn==0.30.6 - uvloop==0.20.0 - vllm==0.6.1.post2 - vllm-flash-attn==2.6.1 - watchfiles==0.24.0 - wcwidth==0.2.13 - websockets==13.0.1 - werkzeug==3.0.2 - wrapt==1.16.0 - wsproto==1.2.0 - xformers==0.0.27.post2 - xxhash==3.5.0 - yarl==1.11.1 - zipp==3.20.2 prefix: /home/ubuntu/miniconda3/envs/pytorch ```
pip_packages
``` accelerate==0.34.1 aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 aiosqlite==0.20.0 aniso8601==9.0.1 annotated-types==0.7.0 ansi2html==1.9.1 anyio==4.4.0 appdirs==1.4.4 arrow==1.3.0 asgiref==3.8.1 attrs==24.2.0 bentoml==1.3.5 bitsandbytes==0.43.3 blinker==1.7.0 cattrs==23.1.2 certifi==2024.2.2 charset-normalizer==3.3.2 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 ctranslate2==4.1.0 cuda-python==12.6.0 datasets==3.0.0 deepmerge==2.0 deprecated==1.2.14 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 dulwich==0.22.1 einops==0.8.0 enum-compat==0.0.3 fastapi==0.115.0 fastcore==1.7.8 ffmpeg==1.4 filelock==3.13.4 flask==3.0.3 flask-restful==0.3.10 frozenlist==1.4.1 fs==2.4.16 fsspec==2024.3.1 gguf==0.9.1 ghapi==1.0.6 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 httpx-ws==0.6.0 huggingface-hub==0.24.6 idna==3.7 importlib-metadata==6.11.0 inflection==0.5.1 inquirerpy==0.3.4 interegular==0.3.3 itsdangerous==2.2.0 jinja2==3.1.2 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.2.2 llvmlite==0.43.0 lm-format-enforcer==0.10.6 markdown-it-py==3.0.0 markupsafe==2.1.3 mdurl==0.1.2 mistral-common==1.4.1 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.18.6 multidict==6.1.0 multiprocess==0.70.16 mypy-extensions==1.0.0 nest-asyncio==1.6.0 networkx==3.2.1 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 nvgpu==0.10.0 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==11.525.150 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.1.105 nvidia-nvtx-cu12==12.1.105 openai==1.41.0 opencv-python-headless==4.10.0.84 openllm==0.6.10 openllm-client==0.5.7 openllm-core==0.5.7 opentelemetry-api==1.20.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 orjson==3.10.7 outlines==0.0.46 packaging==24.0 pandas==2.2.2 partial-json-parser==0.2.1.1.post4 pathlib==1.0.1 pathspec==0.12.1 pfzy==0.3.4 pillow==10.4.0 pip==23.3.1 pip-requirements-parser==32.0.1 prometheus-client==0.20.0 prometheus-fastapi-instrumentator==7.0.0 prompt-toolkit==3.0.36 protobuf==5.28.1 psutil==5.9.8 py-cpuinfo==9.0.0 pyairports==2.1.1 pyaml==24.7.0 pyarrow==17.0.0 pycountry==24.6.1 pydantic==2.9.2 pydantic-core==2.23.4 pygments==2.18.0 pynvml==11.5.0 pyparsing==3.1.4 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-json-logger==2.0.7 python-multipart==0.0.9 pytz==2024.1 pyyaml==6.0.1 pyzmq==26.2.0 questionary==2.0.1 ray==2.36.0 referencing==0.35.1 regex==2024.4.16 requests==2.32.3 rich==13.8.1 rpds-py==0.20.0 safetensors==0.4.3 schema==0.7.7 scipy==1.14.1 sentencepiece==0.2.0 setuptools==68.2.2 shellingham==1.5.4 simple-di==0.1.5 six==1.16.0 sniffio==1.3.1 starlette==0.38.5 sympy==1.12 tabulate==0.9.0 termcolor==2.4.0 tiktoken==0.7.0 tokenizers==0.19.1 tomli-w==1.0.0 torch==2.4.1 torch-model-archiver==0.10.0 torchaudio==2.4.1 torchserve==0.11.1 torchvision==0.19.0 tornado==6.4.1 tqdm==4.66.5 transformers==4.44.2 triton==3.0.0 typer==0.12.5 types-python-dateutil==2.9.0.20240316 typing-extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 uv==0.4.11 uvicorn==0.30.6 uvloop==0.20.0 vllm==0.6.1.post2 vllm-flash-attn==2.6.1 watchfiles==0.24.0 wcwidth==0.2.13 websockets==13.0.1 werkzeug==3.0.2 wheel==0.41.2 wrapt==1.16.0 wsproto==1.2.0 xformers==0.0.27.post2 xxhash==3.5.0 yarl==1.11.1 zipp==3.20.2 ```

System information (Optional)

No response

bojiang commented 1 month ago

Hi. The default max tokens is set for minimal GPU memory usage. We are working on parameterize feature, so you might be able to openllm serve llama3.1:8B --arg vllm.engine.max_tokens=131072 with next minor version.

GaetanBaert commented 3 weeks ago

Hello,

Do you have any ETA about this ? I really need to go further 2048 tokens, since I want to do some RAG.