Closed Said-Ikki closed 4 months ago
at the moment openllm >0.5 requires GPU. I wonder if your AMD GPU is getting picked up correctly?
do you see usage on your GPU?
at the moment openllm >0.5 requires GPU. I wonder if your AMD GPU is getting picked up correctly?
do you see usage on your GPU?
I don't think there were any drivers for this GPU. in any case, I figured a GPU was necessary so I used my 4060 and I ran into some issues with vLLM. I also tried it on an Ubuntu VM without a GPU and that seemed to work the 'most correct' before it realized I didn't have a GPU. I will share more info about the 4060 in a sec but I think WSL makes things kinda wonky
after I run the following: openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code I get this error. What's interesting is that it sorta just doesn't stop, it keeps retrying and doesn't work regardless
/home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 931/931 [00:00<00:00, 11.0MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
/home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
INFO 06-11 11:19:13 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-11 11:19:14 utils.py:451] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 06-11 11:19:14 selector.py:139] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-11 11:19:14 selector.py:51] Using XFormers backend.
INFO 06-11 11:19:15 selector.py:139] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-11 11:19:15 selector.py:51] Using XFormers backend.
INFO 06-11 11:19:15 weight_utils.py:207] Using model weights format ['.safetensors']
INFO 06-11 11:19:22 model_runner.py:146] Loading model weights took 7.1183 GB
INFO 06-11 11:19:33 gpu_executor.py:83] # GPU blocks: 50, # CPU blocks: 682
Traceback (most recent call last):
File "/home/ssikki/.local/lib/python3.10/site-packages/_openllm_tiny/_llm.py", line 176, in _model
return engine_cls.from_engine_args(self._get_engine_args(self._mode))
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
engine = cls(
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init
self.engine = self._init_engine(args, kwargs)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
return engine_class(*args, *kwargs)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init
self._initialize_kv_caches()
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 86, in initialize_cache
self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
raise_if_cache_size_invalid(num_gpu_blocks,
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 375, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (800). Try increasing gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.
2024-06-11T11:19:33-0400 [ERROR] [entry_service:llm-phi-service:1] Initializing service error
Traceback (most recent call last):
File "/home/ssikki/.local/lib/python3.10/site-packages/_openllm_tiny/_llm.py", line 176, in _model
return engine_cls.from_engine_args(self._get_engine_args(self._mode))
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
engine = cls(
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init
self.engine = self._init_engine(args, kwargs)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
return engine_class(*args, **kwargs)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init
self._initialize_kv_caches()
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 86, in initialize_cache
self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache
raise_if_cache_size_invalid(num_gpu_blocks,
File "/home/ssikki/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 375, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (800). Try increasing gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.This is a usage problem. You are running a models with 4k context length. This means the amount of GPU memory required for kv cache is ~ 4GB.
microsoft/Phi-3-mini-4k-instruct requires at least 8GB to load as fp16. So for 4060 this would leaves not a lot of memory left for KV cache. Check out --gpu-memory-utilization from vllm to configure this.
I would suggest to run on larger GPU, to be at least L4 for 4k context.
You can also try out quantization version.
dumb question: how can I run the quanitized version? And while I'm here, will I need to clear the cache out to make space or will it be fine?
Check out huggingface hub for pre-quantized models. vLLM currently only support pre-quantized models
Describe the bug
I recently tried using openllm to connect to llama and it would give me some bentoml config errors. I'm not sure if its because I don't have a GPU but I didn't see any evidence online for that being the case
To reproduce
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. warnings.warn( Serialisation format is not specified. Defaulting to 'safetensors'. Your model might not work with this format. Make sure to explicitly specify the serialisation format. Traceback (most recent call last): File "/home/ssikki/.local/bin/openllm", line 8, inLogs
No response
Environment
bentoml env
Environment variable
System information
bentoml
: 1.2.17python
: 3.10.12platform
: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35uid_gid
: 1000:1000pip_packages
``` aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 anyio==4.4.0 appdirs==1.4.4 asgiref==3.8.1 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 bcrypt==4.1.3 beautifulsoup4==4.12.3 bentoml==1.2.17 bidict==0.23.1 blinker==1.8.2 bs4==0.0.2 build==1.2.1 cachetools==5.3.3 cattrs==23.1.2 certifi==2024.6.2 cffi==1.16.0 chardet==5.2.0 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.0 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 cmake==3.29.3 coloredlogs==15.0.1 command-not-found==0.3 contourpy==1.2.1 cryptography==42.0.8 cuda-python==12.5.0 cycler==0.12.1 dataclasses-json==0.6.6 dbus-python==1.2.18 deepdiff==7.0.1 deepmerge==1.1.1 Deprecated==1.2.14 dirtyjson==1.0.8 diskcache==5.6.3 distro==1.7.0 distro-info==1.1+ubuntu0.1 dnspython==2.6.1 effdet==0.4.1 einops==0.8.0 email_validator==2.1.1 emoji==2.12.1 et-xmlfile==1.1.0 exceptiongroup==1.2.1 fastapi==0.111.0 fastapi-cli==0.0.4 fastcore==1.5.44 filelock==3.14.0 filetype==1.2.0 Flask==3.0.3 Flask-SocketIO==5.3.6 flatbuffers==24.3.25 fonttools==4.53.0 frozenlist==1.4.1 fs==2.4.16 fsspec==2024.6.0 ghapi==1.0.5 google-api-core==2.19.0 google-auth==2.29.0 google-cloud-vision==3.7.2 googleapis-common-protos==1.63.1 greenlet==3.0.3 grpcio==1.64.1 grpcio-status==1.62.2 h11==0.14.0 httpcore==1.0.5 httplib2==0.20.2 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.23.3 humanfriendly==10.0 idna==3.7 importlib-metadata==6.11.0 importlib_resources==6.4.0 inflection==0.5.1 interegular==0.3.3 iopath==0.1.10 itsdangerous==2.2.0 jeepney==0.7.1 Jinja2==3.1.4 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==2.4 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 keyring==23.5.0 kiwisolver==1.4.5 kubernetes==29.0.0 langchain==0.2.2 langchain-community==0.2.2 langchain-core==0.2.4 langchain-text-splitters==0.2.1 langdetect==1.0.9 langsmith==0.1.72 lark==1.1.9 launchpadlib==1.10.16 layoutparser==0.3.4 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 llama-index-core==0.10.43 llama-index-llms-openllm==0.1.5 llamaindex-py-client==0.1.19 llvmlite==0.42.0 lm-format-enforcer==0.10.1 lxml==5.2.2 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.21.2 matplotlib==3.9.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 more-itertools==8.10.0 mpmath==1.3.0 msg-parser==1.2.0 msgpack==1.0.8 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 netifaces==0.11.0 networkx==3.3 ninja==1.11.1.1 nltk==3.8.1 numba==0.59.1 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==11.525.150 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 olefile==0.47 omegaconf==2.3.0 onnx==1.16.1 onnxruntime==1.18.0 openai==1.31.1 opencv-python==4.10.0.82 openllm==0.5.5 openllm-client==0.5.5 openllm-core==0.5.5 openpyxl==3.1.3 opentelemetry-api==1.20.0 opentelemetry-exporter-otlp-proto-common==1.25.0 opentelemetry-exporter-otlp-proto-grpc==1.25.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-instrumentation-fastapi==0.46b0 opentelemetry-proto==1.25.0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 ordered-set==4.1.0 orjson==3.10.3 outlines==0.0.34 overrides==7.7.0 packaging==23.2 pandas==2.2.2 pathspec==0.12.1 pdf2image==1.17.0 pdfminer.six==20231228 pdfplumber==0.11.0 pikepdf==9.0.0 pillow==10.3.0 pillow_heif==0.16.0 pip-requirements-parser==32.0.1 pip-tools==7.4.1 portalocker==2.8.2 posthog==3.5.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 proto-plus==1.23.0 protobuf==4.25.3 psutil==5.9.8 py-cpuinfo==9.0.0 pyarrow==16.1.0 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycocotools==2.0.7 pycparser==2.22 pydantic==2.7.3 pydantic_core==2.18.4 Pygments==2.18.0 PyGObject==3.42.1 PyJWT==2.3.0 pypandoc==1.13 pyparsing==2.4.7 pypdf==4.2.0 pypdfium2==4.30.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pytesseract==0.3.10 python-apt==2.4.0+ubuntu2 python-dateutil==2.9.0.post0 python-docx==1.1.2 python-dotenv==1.0.1 python-engineio==4.9.1 python-iso639==2024.4.27 python-json-logger==2.0.7 python-magic==0.4.27 python-multipart==0.0.9 python-pptx==0.6.23 python-socketio==5.11.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.3 rapidfuzz==3.9.3 ray==2.23.0 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 requests-oauthlib==2.0.0 requests-toolbelt==1.0.0 rich==13.7.1 rpds-py==0.18.1 rsa==4.9 safetensors==0.4.3 schema==0.7.7 scipy==1.13.1 SecretStorage==3.3.1 sentencepiece==0.2.0 shellingham==1.5.4 simple-di==0.1.5 simple-websocket==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.30 starlette==0.37.2 sympy==1.12.1 systemd-python==234 tabulate==0.9.0 tenacity==8.3.0 tiktoken==0.7.0 timm==1.0.3 tokenizers==0.19.1 tomli==2.0.1 tomli_w==1.0.0 torch==2.3.0 torchvision==0.18.0 tornado==6.4 tqdm==4.66.4 transformers==4.41.2 triton==2.3.0 typer==0.12.3 typing-inspect==0.9.0 typing_extensions==4.12.1 tzdata==2024.1 ubuntu-advantage-tools==8001 ufw==0.36.1 ujson==5.10.0 unattended-upgrades==0.1 unstructured==0.14.4 unstructured-client==0.23.0 unstructured-inference==0.7.33 unstructured.pytesseract==0.3.12 urllib3==2.2.1 uvicorn==0.30.1 uvloop==0.19.0 vllm==0.4.3 vllm-flash-attn==2.5.8.post2 wadllib==1.3.6 watchfiles==0.22.0 websocket-client==1.8.0 websockets==12.0 Werkzeug==3.0.3 wrapt==1.16.0 wsproto==1.2.0 xformers==0.0.26.post1 xlrd==2.0.1 XlsxWriter==3.2.0 yarl==1.9.4 zipp==1.0.0 ```
transformers-cli env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.41.2System information (Optional)
CPU: AMD Ryzen 5 5500U with Radeon Graphics GPU: (not in use) AMD Radeon(TM) Graphics RAM: 8GB Platform: WSL Ubuntu. The python interpreter is set to WSL already