bug: Cannot Run an OpenLLM server regardless of where I try to get it from or what model I use

Said-Ikki commented 4 months ago

Describe the bug

I recently tried using openllm to connect to llama and it would give me some bentoml config errors. I'm not sure if its because I don't have a GPU but I didn't see any evidence online for that being the case

To reproduce

openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code (but it could be any of the other models)
errors /home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Serialisation format is not specified. Defaulting to 'safetensors'. Your model might not work with this format. Make sure to explicitly specify the serialisation format. Traceback (most recent call last): File "/home/ssikki/.local/bin/openllm", line 8, in sys.exit(cli()) File "/home/ssikki/.local/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/ssikki/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ssikki/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ssikki/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/_openllm_tiny/_entrypoint.py", line 284, in start_command load('.', working_dir=working_dir).inject_config() File "/home/ssikki/.local/lib/python3.10/site-packages/_bentoml_sdk/service/factory.py", line 277, in inject_config load_config(override_defaults=override_defaults, use_version=2) File "/home/ssikki/.local/lib/python3.10/site-packages/bentoml/_internal/configuration/init.py", line 191, in load_config BentoMLConfiguration( File "/home/ssikki/.local/lib/python3.10/site-packages/bentoml/_internal/configuration/containers.py", line 140, in init raise BentoMLConfigException( bentoml.exceptions.BentoMLConfigException: Invalid configuration file was given: Key 'services' error: Key 'llm-phi-service' error: Key 'resources' error: Or({Optional('cpu'): <class 'str'>, Optional('memory'): <class 'str'>, Optional('gpu'): And(<class 'numbers.Real'>, <function ensure_larger_than..v at 0x7f74122f8550>), Optional('gpu_type'): <class 'str'>, Optional('tpu_type'): <class 'str'>}, None) did not validate {'gpu': 0} Key 'gpu' error: v(0) should evaluate to True None does not match {'gpu': 0}

Logs

No response

Environment

bentoml env

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.2.17 python: 3.10.12 platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 uid_gid: 1000:1000

pip_packages

``` aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 anyio==4.4.0 appdirs==1.4.4 asgiref==3.8.1 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 bcrypt==4.1.3 beautifulsoup4==4.12.3 bentoml==1.2.17 bidict==0.23.1 blinker==1.8.2 bs4==0.0.2 build==1.2.1 cachetools==5.3.3 cattrs==23.1.2 certifi==2024.6.2 cffi==1.16.0 chardet==5.2.0 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.0 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 cmake==3.29.3 coloredlogs==15.0.1 command-not-found==0.3 contourpy==1.2.1 cryptography==42.0.8 cuda-python==12.5.0 cycler==0.12.1 dataclasses-json==0.6.6 dbus-python==1.2.18 deepdiff==7.0.1 deepmerge==1.1.1 Deprecated==1.2.14 dirtyjson==1.0.8 diskcache==5.6.3 distro==1.7.0 distro-info==1.1+ubuntu0.1 dnspython==2.6.1 effdet==0.4.1 einops==0.8.0 email_validator==2.1.1 emoji==2.12.1 et-xmlfile==1.1.0 exceptiongroup==1.2.1 fastapi==0.111.0 fastapi-cli==0.0.4 fastcore==1.5.44 filelock==3.14.0 filetype==1.2.0 Flask==3.0.3 Flask-SocketIO==5.3.6 flatbuffers==24.3.25 fonttools==4.53.0 frozenlist==1.4.1 fs==2.4.16 fsspec==2024.6.0 ghapi==1.0.5 google-api-core==2.19.0 google-auth==2.29.0 google-cloud-vision==3.7.2 googleapis-common-protos==1.63.1 greenlet==3.0.3 grpcio==1.64.1 grpcio-status==1.62.2 h11==0.14.0 httpcore==1.0.5 httplib2==0.20.2 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.23.3 humanfriendly==10.0 idna==3.7 importlib-metadata==6.11.0 importlib_resources==6.4.0 inflection==0.5.1 interegular==0.3.3 iopath==0.1.10 itsdangerous==2.2.0 jeepney==0.7.1 Jinja2==3.1.4 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==2.4 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 keyring==23.5.0 kiwisolver==1.4.5 kubernetes==29.0.0 langchain==0.2.2 langchain-community==0.2.2 langchain-core==0.2.4 langchain-text-splitters==0.2.1 langdetect==1.0.9 langsmith==0.1.72 lark==1.1.9 launchpadlib==1.10.16 layoutparser==0.3.4 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 llama-index-core==0.10.43 llama-index-llms-openllm==0.1.5 llamaindex-py-client==0.1.19 llvmlite==0.42.0 lm-format-enforcer==0.10.1 lxml==5.2.2 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.21.2 matplotlib==3.9.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 more-itertools==8.10.0 mpmath==1.3.0 msg-parser==1.2.0 msgpack==1.0.8 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 netifaces==0.11.0 networkx==3.3 ninja==1.11.1.1 nltk==3.8.1 numba==0.59.1 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==11.525.150 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 olefile==0.47 omegaconf==2.3.0 onnx==1.16.1 onnxruntime==1.18.0 openai==1.31.1 opencv-python==4.10.0.82 openllm==0.5.5 openllm-client==0.5.5 openllm-core==0.5.5 openpyxl==3.1.3 opentelemetry-api==1.20.0 opentelemetry-exporter-otlp-proto-common==1.25.0 opentelemetry-exporter-otlp-proto-grpc==1.25.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-instrumentation-fastapi==0.46b0 opentelemetry-proto==1.25.0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 ordered-set==4.1.0 orjson==3.10.3 outlines==0.0.34 overrides==7.7.0 packaging==23.2 pandas==2.2.2 pathspec==0.12.1 pdf2image==1.17.0 pdfminer.six==20231228 pdfplumber==0.11.0 pikepdf==9.0.0 pillow==10.3.0 pillow_heif==0.16.0 pip-requirements-parser==32.0.1 pip-tools==7.4.1 portalocker==2.8.2 posthog==3.5.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 proto-plus==1.23.0 protobuf==4.25.3 psutil==5.9.8 py-cpuinfo==9.0.0 pyarrow==16.1.0 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycocotools==2.0.7 pycparser==2.22 pydantic==2.7.3 pydantic_core==2.18.4 Pygments==2.18.0 PyGObject==3.42.1 PyJWT==2.3.0 pypandoc==1.13 pyparsing==2.4.7 pypdf==4.2.0 pypdfium2==4.30.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pytesseract==0.3.10 python-apt==2.4.0+ubuntu2 python-dateutil==2.9.0.post0 python-docx==1.1.2 python-dotenv==1.0.1 python-engineio==4.9.1 python-iso639==2024.4.27 python-json-logger==2.0.7 python-magic==0.4.27 python-multipart==0.0.9 python-pptx==0.6.23 python-socketio==5.11.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.3 rapidfuzz==3.9.3 ray==2.23.0 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 requests-oauthlib==2.0.0 requests-toolbelt==1.0.0 rich==13.7.1 rpds-py==0.18.1 rsa==4.9 safetensors==0.4.3 schema==0.7.7 scipy==1.13.1 SecretStorage==3.3.1 sentencepiece==0.2.0 shellingham==1.5.4 simple-di==0.1.5 simple-websocket==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.30 starlette==0.37.2 sympy==1.12.1 systemd-python==234 tabulate==0.9.0 tenacity==8.3.0 tiktoken==0.7.0 timm==1.0.3 tokenizers==0.19.1 tomli==2.0.1 tomli_w==1.0.0 torch==2.3.0 torchvision==0.18.0 tornado==6.4 tqdm==4.66.4 transformers==4.41.2 triton==2.3.0 typer==0.12.3 typing-inspect==0.9.0 typing_extensions==4.12.1 tzdata==2024.1 ubuntu-advantage-tools==8001 ufw==0.36.1 ujson==5.10.0 unattended-upgrades==0.1 unstructured==0.14.4 unstructured-client==0.23.0 unstructured-inference==0.7.33 unstructured.pytesseract==0.3.12 urllib3==2.2.1 uvicorn==0.30.1 uvloop==0.19.0 vllm==0.4.3 vllm-flash-attn==2.5.8.post2 wadllib==1.3.6 watchfiles==0.22.0 websocket-client==1.8.0 websockets==12.0 Werkzeug==3.0.3 wrapt==1.16.0 wsproto==1.2.0 xformers==0.0.26.post1 xlrd==2.0.1 XlsxWriter==3.2.0 yarl==1.9.4 zipp==1.0.0 ```

transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.41.2
Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

System information (Optional)

CPU: AMD Ryzen 5 5500U with Radeon Graphics GPU: (not in use) AMD Radeon(TM) Graphics RAM: 8GB Platform: WSL Ubuntu. The python interpreter is set to WSL already

aarnphm commented 4 months ago

at the moment openllm >0.5 requires GPU. I wonder if your AMD GPU is getting picked up correctly?

do you see usage on your GPU?

Said-Ikki commented 4 months ago

at the moment openllm >0.5 requires GPU. I wonder if your AMD GPU is getting picked up correctly?

do you see usage on your GPU?

I don't think there were any drivers for this GPU. in any case, I figured a GPU was necessary so I used my 4060 and I ran into some issues with vLLM. I also tried it on an Ubuntu VM without a GPU and that seemed to work the 'most correct' before it realized I didn't have a GPU. I will share more info about the 4060 in a sec but I think WSL makes things kinda wonky

Said-Ikki commented 4 months ago

after I run the following: openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code I get this error. What's interesting is that it sorta just doesn't stop, it keeps retrying and doesn't work regardless

/home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 931/931 [00:00<00:00, 11.0MB/s] A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:

configuration_phi3.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. Serialisation format is not specified. Defaulting to 'safetensors'. Your model might not work with this format. Make sure to explicitly specify the serialisation format. 2024-06-11T11:19:10-0400 [INFO] [] Starting production HTTP BentoServer from "_service:LLMService" listening on http://localhost:3000 (Press CTRL+C to quit) /home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( INFO 06-11 11:19:13 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-11 11:19:14 utils.py:451] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. INFO 06-11 11:19:14 selector.py:139] Cannot use FlashAttention-2 backend due to sliding window. INFO 06-11 11:19:14 selector.py:51] Using XFormers backend. INFO 06-11 11:19:15 selector.py:139] Cannot use FlashAttention-2 backend due to sliding window. INFO 06-11 11:19:15 selector.py:51] Using XFormers backend. INFO 06-11 11:19:15 weight_utils.py:207] Using model weights format ['.safetensors'] INFO 06-11 11:19:22 model_runner.py:146] Loading model weights took 7.1183 GB INFO 06-11 11:19:33 gpu_executor.py:83] # GPU blocks: 50, # CPU blocks: 682 Traceback (most recent call last): File "/home/ssikki/.local/lib/python3.10/site-packages/_openllm_tiny/_llm.py", line 176, in _model return engine_cls.from_engine_args(self._get_engine_args(self._mode)) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args engine = cls( File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init self.engine = self._init_engine(args, kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine return engine_class(*args, *kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init self._initialize_kv_caches() File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 86, in initialize_cache self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache raise_if_cache_size_invalid(num_gpu_blocks, File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 375, in raise_if_cache_size_invalid raise ValueError( ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (800). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. 2024-06-11T11:19:33-0400 [ERROR] [entry_service:llm-phi-service:1] Initializing service error Traceback (most recent call last): File "/home/ssikki/.local/lib/python3.10/site-packages/_openllm_tiny/_llm.py", line 176, in _model return engine_cls.from_engine_args(self._get_engine_args(self._mode)) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args engine = cls( File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init self.engine = self._init_engine(args, kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine return engine_class(*args, **kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init self._initialize_kv_caches() File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 86, in initialize_cache self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache raise_if_cache_size_invalid(num_gpu_blocks, File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 375, in raise_if_cache_size_invalid raise ValueError( ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (800). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

aarnphm commented 4 months ago

This is a usage problem. You are running a models with 4k context length. This means the amount of GPU memory required for kv cache is ~ 4GB.

microsoft/Phi-3-mini-4k-instruct requires at least 8GB to load as fp16. So for 4060 this would leaves not a lot of memory left for KV cache. Check out --gpu-memory-utilization from vllm to configure this.

I would suggest to run on larger GPU, to be at least L4 for 4k context.

You can also try out quantization version.

Said-Ikki commented 4 months ago

dumb question: how can I run the quanitized version? And while I'm here, will I need to clear the cache out to make space or will it be fine?

aarnphm commented 4 months ago

Check out huggingface hub for pre-quantized models. vLLM currently only support pre-quantized models

bentoml / OpenLLM