Open GaetanBaert opened 2 months ago
Hi. The default max tokens is set for minimal GPU memory usage. We are working on parameterize feature, so you might be able to openllm serve llama3.1:8B --arg vllm.engine.max_tokens=131072
with next minor version.
Hello,
Do you have any ETA about this ? I really need to go further 2048 tokens, since I want to do some RAG.
Describe the bug
When I try to serve a llama 3.1 8B-4bit with openllm, it says that "This model's maximum context length is 2048 tokens". On https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, it says that the maximum context length is 128k tokens.
Why this difference ?
To reproduce
openllm serve llama3.1:8b-4bit
in a python console with openai client installed :
Logs
Environment
System information
bentoml
: 1.3.5python
: 3.11.8platform
: Linux-6.2.0-39-generic-x86_64-with-glibc2.37uid_gid
: 1000:1000conda
: 24.3.0in_conda_env
: Trueconda_packages
```yaml name: pytorch channels: - conda-forge - defaults dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=2_gnu - aom=3.9.1=hac33072_0 - bzip2=1.0.8=h5eee18b_5 - ca-certificates=2024.8.30=hbcca054_0 - cairo=1.18.0=hebfffa5_3 - dav1d=1.2.1=hd590300_0 - expat=2.6.3=h5888daf_0 - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 - font-ttf-inconsolata=3.000=h77eed37_0 - font-ttf-source-code-pro=2.038=h77eed37_0 - font-ttf-ubuntu=0.83=h77eed37_2 - fontconfig=2.14.2=h14ed4e7_0 - fonts-conda-ecosystem=1=0 - fonts-conda-forge=1=0 - freetype=2.12.1=h267a509_2 - fribidi=1.0.10=h36c2ea0_0 - gettext=0.22.5=he02047a_3 - gettext-tools=0.22.5=he02047a_3 - gmp=6.3.0=hac33072_2 - gnutls=3.8.7=h32866dd_0 - graphite2=1.3.13=h59595ed_1003 - harfbuzz=9.0.0=hda332d3_1 - icu=75.1=he02047a_0 - lame=3.100=h166bdaf_1003 - ld_impl_linux-64=2.38=h1181459_1 - libabseil=20240116.2=cxx17_he02047a_1 - libasprintf=0.22.5=he8f35ee_3 - libasprintf-devel=0.22.5=he8f35ee_3 - libass=0.17.3=h1dc1e6a_0 - libdrm=2.4.123=hb9d3cd8_0 - libexpat=2.6.3=h5888daf_0 - libffi=3.4.4=h6a678d5_0 - libgcc=14.1.0=h77fa898_1 - libgcc-ng=14.1.0=h69a702a_1 - libgettextpo=0.22.5=he02047a_3 - libgettextpo-devel=0.22.5=he02047a_3 - libglib=2.80.3=h315aac3_2 - libgomp=14.1.0=h77fa898_1 - libhwloc=2.11.1=default_hecaa2ac_1000 - libiconv=1.17=hd590300_2 - libidn2=2.3.7=hd590300_0 - libnsl=2.0.1=hd590300_0 - libopenvino=2024.3.0=h2da1b83_0 - libopenvino-auto-batch-plugin=2024.3.0=hb045406_0 - libopenvino-auto-plugin=2024.3.0=hb045406_0 - libopenvino-hetero-plugin=2024.3.0=h5c03a75_0 - libopenvino-intel-cpu-plugin=2024.3.0=h2da1b83_0 - libopenvino-intel-gpu-plugin=2024.3.0=h2da1b83_0 - libopenvino-intel-npu-plugin=2024.3.0=h2da1b83_0 - libopenvino-ir-frontend=2024.3.0=h5c03a75_0 - libopenvino-onnx-frontend=2024.3.0=h07e8aee_0 - libopenvino-paddle-frontend=2024.3.0=h07e8aee_0 - libopenvino-pytorch-frontend=2024.3.0=he02047a_0 - libopenvino-tensorflow-frontend=2024.3.0=h39126c6_0 - libopenvino-tensorflow-lite-frontend=2024.3.0=he02047a_0 - libopus=1.3.1=h7f98852_1 - libpciaccess=0.18=hd590300_0 - libpng=1.6.44=hadc24fc_0 - libprotobuf=4.25.3=h08a7969_0 - libsqlite=3.45.2=h2797004_0 - libstdcxx=14.1.0=hc0a3c3a_1 - libstdcxx-ng=14.1.0=h4852527_1 - libtasn1=4.19.0=h166bdaf_0 - libunistring=0.9.10=h7f98852_0 - libuuid=2.38.1=h0b41bf4_0 - libva=2.22.0=hb711507_0 - libvpx=1.14.1=hac33072_0 - libxcb=1.16=hb9d3cd8_1 - libxcrypt=4.4.36=hd590300_1 - libxml2=2.12.7=he7c6b58_4 - libzlib=1.3.1=h4ab18f5_1 - ncurses=6.4=h6a678d5_0 - nettle=3.9.1=h7ab15ed_0 - ocl-icd=2.3.2=hd590300_1 - openh264=2.4.1=h59595ed_0 - openssl=3.3.2=hb9d3cd8_0 - p11-kit=0.24.1=hc5aa10d_0 - pcre2=10.44=hba22ea6_2 - pip=23.3.1=py311h06a4308_0 - pixman=0.43.2=h59595ed_0 - pthread-stubs=0.4=h36c2ea0_1001 - pugixml=1.14=h59595ed_0 - python=3.11.8=hab00c5b_0_cpython - readline=8.2=h5eee18b_0 - setuptools=68.2.2=py311h06a4308_0 - snappy=1.2.1=ha2e4443_0 - sqlite=3.45.2=h2c6b66d_0 - svt-av1=2.2.1=h5888daf_0 - tbb=2021.13.0=h84d6215_0 - tk=8.6.13=noxft_h4845f30_101 - wayland=1.23.1=h3e06ad9_0 - wayland-protocols=1.37=hd8ed1ab_0 - wheel=0.41.2=py311h06a4308_0 - x264=1!164.3095=h166bdaf_2 - x265=3.5=h924138e_3 - xorg-fixesproto=5.0=h7f98852_1002 - xorg-kbproto=1.0.7=h7f98852_1002 - xorg-libice=1.1.1=hd590300_0 - xorg-libsm=1.2.4=h7391055_0 - xorg-libx11=1.8.9=hb711507_1 - xorg-libxau=1.0.11=hd590300_0 - xorg-libxdmcp=1.1.3=h7f98852_0 - xorg-libxext=1.3.4=h0b41bf4_2 - xorg-libxfixes=5.0.3=h7f98852_1004 - xorg-libxrender=0.9.11=hd590300_0 - xorg-renderproto=0.11.1=h7f98852_1002 - xorg-xextproto=7.3.0=h0b41bf4_1003 - xorg-xproto=7.0.31=h7f98852_1007 - xz=5.4.6=h5eee18b_0 - zlib=1.3.1=h4ab18f5_1 - pip: - accelerate==0.34.1 - aiohappyeyeballs==2.4.0 - aiohttp==3.10.5 - aiosignal==1.3.1 - aiosqlite==0.20.0 - aniso8601==9.0.1 - annotated-types==0.7.0 - ansi2html==1.9.1 - anyio==4.4.0 - appdirs==1.4.4 - arrow==1.3.0 - asgiref==3.8.1 - attrs==24.2.0 - bentoml==1.3.5 - bitsandbytes==0.43.3 - blinker==1.7.0 - cattrs==23.1.2 - certifi==2024.2.2 - charset-normalizer==3.3.2 - circus==0.18.0 - click==8.1.7 - click-option-group==0.5.6 - cloudpickle==3.0.0 - ctranslate2==4.1.0 - cuda-python==12.6.0 - datasets==3.0.0 - deepmerge==2.0 - deprecated==1.2.14 - dill==0.3.8 - diskcache==5.6.3 - distro==1.9.0 - dulwich==0.22.1 - einops==0.8.0 - enum-compat==0.0.3 - fastapi==0.115.0 - fastcore==1.7.8 - ffmpeg==1.4 - filelock==3.13.4 - flask==3.0.3 - flask-restful==0.3.10 - frozenlist==1.4.1 - fs==2.4.16 - fsspec==2024.3.1 - gguf==0.9.1 - ghapi==1.0.6 - h11==0.14.0 - httpcore==1.0.5 - httptools==0.6.1 - httpx==0.27.2 - httpx-ws==0.6.0 - huggingface-hub==0.24.6 - idna==3.7 - importlib-metadata==6.11.0 - inflection==0.5.1 - inquirerpy==0.3.4 - interegular==0.3.3 - itsdangerous==2.2.0 - jinja2==3.1.2 - jiter==0.5.0 - jsonschema==4.23.0 - jsonschema-specifications==2023.12.1 - lark==1.2.2 - llvmlite==0.43.0 - lm-format-enforcer==0.10.6 - markdown-it-py==3.0.0 - markupsafe==2.1.3 - mdurl==0.1.2 - mistral-common==1.4.1 - mpmath==1.3.0 - msgpack==1.1.0 - msgspec==0.18.6 - multidict==6.1.0 - multiprocess==0.70.16 - mypy-extensions==1.0.0 - nest-asyncio==1.6.0 - networkx==3.2.1 - ninja==1.11.1.1 - numba==0.60.0 - numpy==1.26.4 - nvgpu==0.10.0 - nvidia-cublas-cu12==12.1.3.1 - nvidia-cuda-cupti-cu12==12.1.105 - nvidia-cuda-nvrtc-cu12==12.1.105 - nvidia-cuda-runtime-cu12==12.1.105 - nvidia-cudnn-cu12==9.1.0.70 - nvidia-cufft-cu12==11.0.2.54 - nvidia-curand-cu12==10.3.2.106 - nvidia-cusolver-cu12==11.4.5.107 - nvidia-cusparse-cu12==12.1.0.106 - nvidia-ml-py==11.525.150 - nvidia-nccl-cu12==2.20.5 - nvidia-nvjitlink-cu12==12.1.105 - nvidia-nvtx-cu12==12.1.105 - openai==1.41.0 - opencv-python-headless==4.10.0.84 - openllm==0.6.10 - openllm-client==0.5.7 - openllm-core==0.5.7 - opentelemetry-api==1.20.0 - opentelemetry-instrumentation==0.41b0 - opentelemetry-instrumentation-aiohttp-client==0.41b0 - opentelemetry-instrumentation-asgi==0.41b0 - opentelemetry-sdk==1.20.0 - opentelemetry-semantic-conventions==0.41b0 - opentelemetry-util-http==0.41b0 - orjson==3.10.7 - outlines==0.0.46 - packaging==24.0 - pandas==2.2.2 - partial-json-parser==0.2.1.1.post4 - pathlib==1.0.1 - pathspec==0.12.1 - pfzy==0.3.4 - pillow==10.4.0 - pip-requirements-parser==32.0.1 - prometheus-client==0.20.0 - prometheus-fastapi-instrumentator==7.0.0 - prompt-toolkit==3.0.36 - protobuf==5.28.1 - psutil==5.9.8 - py-cpuinfo==9.0.0 - pyairports==2.1.1 - pyaml==24.7.0 - pyarrow==17.0.0 - pycountry==24.6.1 - pydantic==2.9.2 - pydantic-core==2.23.4 - pygments==2.18.0 - pynvml==11.5.0 - pyparsing==3.1.4 - python-dateutil==2.9.0.post0 - python-dotenv==1.0.1 - python-json-logger==2.0.7 - python-multipart==0.0.9 - pytz==2024.1 - pyyaml==6.0.1 - pyzmq==26.2.0 - questionary==2.0.1 - ray==2.36.0 - referencing==0.35.1 - regex==2024.4.16 - requests==2.32.3 - rich==13.8.1 - rpds-py==0.20.0 - safetensors==0.4.3 - schema==0.7.7 - scipy==1.14.1 - sentencepiece==0.2.0 - shellingham==1.5.4 - simple-di==0.1.5 - six==1.16.0 - sniffio==1.3.1 - starlette==0.38.5 - sympy==1.12 - tabulate==0.9.0 - termcolor==2.4.0 - tiktoken==0.7.0 - tokenizers==0.19.1 - tomli-w==1.0.0 - torch==2.4.1 - torch-model-archiver==0.10.0 - torchaudio==2.4.1 - torchserve==0.11.1 - torchvision==0.19.0 - tornado==6.4.1 - tqdm==4.66.5 - transformers==4.44.2 - triton==3.0.0 - typer==0.12.5 - types-python-dateutil==2.9.0.20240316 - typing-extensions==4.11.0 - tzdata==2024.1 - urllib3==2.2.1 - uv==0.4.11 - uvicorn==0.30.6 - uvloop==0.20.0 - vllm==0.6.1.post2 - vllm-flash-attn==2.6.1 - watchfiles==0.24.0 - wcwidth==0.2.13 - websockets==13.0.1 - werkzeug==3.0.2 - wrapt==1.16.0 - wsproto==1.2.0 - xformers==0.0.27.post2 - xxhash==3.5.0 - yarl==1.11.1 - zipp==3.20.2 prefix: /home/ubuntu/miniconda3/envs/pytorch ```
pip_packages
``` accelerate==0.34.1 aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 aiosqlite==0.20.0 aniso8601==9.0.1 annotated-types==0.7.0 ansi2html==1.9.1 anyio==4.4.0 appdirs==1.4.4 arrow==1.3.0 asgiref==3.8.1 attrs==24.2.0 bentoml==1.3.5 bitsandbytes==0.43.3 blinker==1.7.0 cattrs==23.1.2 certifi==2024.2.2 charset-normalizer==3.3.2 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 ctranslate2==4.1.0 cuda-python==12.6.0 datasets==3.0.0 deepmerge==2.0 deprecated==1.2.14 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 dulwich==0.22.1 einops==0.8.0 enum-compat==0.0.3 fastapi==0.115.0 fastcore==1.7.8 ffmpeg==1.4 filelock==3.13.4 flask==3.0.3 flask-restful==0.3.10 frozenlist==1.4.1 fs==2.4.16 fsspec==2024.3.1 gguf==0.9.1 ghapi==1.0.6 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 httpx-ws==0.6.0 huggingface-hub==0.24.6 idna==3.7 importlib-metadata==6.11.0 inflection==0.5.1 inquirerpy==0.3.4 interegular==0.3.3 itsdangerous==2.2.0 jinja2==3.1.2 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.2.2 llvmlite==0.43.0 lm-format-enforcer==0.10.6 markdown-it-py==3.0.0 markupsafe==2.1.3 mdurl==0.1.2 mistral-common==1.4.1 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.18.6 multidict==6.1.0 multiprocess==0.70.16 mypy-extensions==1.0.0 nest-asyncio==1.6.0 networkx==3.2.1 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 nvgpu==0.10.0 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==11.525.150 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.1.105 nvidia-nvtx-cu12==12.1.105 openai==1.41.0 opencv-python-headless==4.10.0.84 openllm==0.6.10 openllm-client==0.5.7 openllm-core==0.5.7 opentelemetry-api==1.20.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 orjson==3.10.7 outlines==0.0.46 packaging==24.0 pandas==2.2.2 partial-json-parser==0.2.1.1.post4 pathlib==1.0.1 pathspec==0.12.1 pfzy==0.3.4 pillow==10.4.0 pip==23.3.1 pip-requirements-parser==32.0.1 prometheus-client==0.20.0 prometheus-fastapi-instrumentator==7.0.0 prompt-toolkit==3.0.36 protobuf==5.28.1 psutil==5.9.8 py-cpuinfo==9.0.0 pyairports==2.1.1 pyaml==24.7.0 pyarrow==17.0.0 pycountry==24.6.1 pydantic==2.9.2 pydantic-core==2.23.4 pygments==2.18.0 pynvml==11.5.0 pyparsing==3.1.4 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-json-logger==2.0.7 python-multipart==0.0.9 pytz==2024.1 pyyaml==6.0.1 pyzmq==26.2.0 questionary==2.0.1 ray==2.36.0 referencing==0.35.1 regex==2024.4.16 requests==2.32.3 rich==13.8.1 rpds-py==0.20.0 safetensors==0.4.3 schema==0.7.7 scipy==1.14.1 sentencepiece==0.2.0 setuptools==68.2.2 shellingham==1.5.4 simple-di==0.1.5 six==1.16.0 sniffio==1.3.1 starlette==0.38.5 sympy==1.12 tabulate==0.9.0 termcolor==2.4.0 tiktoken==0.7.0 tokenizers==0.19.1 tomli-w==1.0.0 torch==2.4.1 torch-model-archiver==0.10.0 torchaudio==2.4.1 torchserve==0.11.1 torchvision==0.19.0 tornado==6.4.1 tqdm==4.66.5 transformers==4.44.2 triton==3.0.0 typer==0.12.5 types-python-dateutil==2.9.0.20240316 typing-extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 uv==0.4.11 uvicorn==0.30.6 uvloop==0.20.0 vllm==0.6.1.post2 vllm-flash-attn==2.6.1 watchfiles==0.24.0 wcwidth==0.2.13 websockets==13.0.1 werkzeug==3.0.2 wheel==0.41.2 wrapt==1.16.0 wsproto==1.2.0 xformers==0.0.27.post2 xxhash==3.5.0 yarl==1.11.1 zipp==3.20.2 ```
transformers
version: 4.44.2System information (Optional)
No response