bug: Cannot Run an OpenLLM server regardless of where I try to get it from or what model I use #1009

Closed Said-Ikki closed 1 month ago

Said-Ikki commented 1 month ago

Describe the bug

I recently tried using openllm to connect to llama and it would give me some bentoml config errors. I'm not sure if its because I don't have a GPU but I didn't see any evidence online for that being the case

To reproduce

  1. openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code (but it could be any of the other models)
  2. errors /home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/ FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Serialisation format is not specified. Defaulting to 'safetensors'. Your model might not work with this format. Make sure to explicitly specify the serialisation format. Traceback (most recent call last): File "/home/ssikki/.local/bin/openllm", line 8, in sys.exit(cli()) File "/home/ssikki/.local/lib/python3.10/site-packages/click/", line 1157, in call return self.main(args, kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/click/", line 1078, in main rv = self.invoke(ctx) File "/home/ssikki/.local/lib/python3.10/site-packages/click/", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ssikki/.local/lib/python3.10/site-packages/click/", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ssikki/.local/lib/python3.10/site-packages/click/", line 783, in invoke return __callback(args, **kwargs) File "/home/ssikki/.local/lib/python3.10/site-packages/_openllm_tiny/", line 284, in start_command load('.', working_dir=working_dir).inject_config() File "/home/ssikki/.local/lib/python3.10/site-packages/_bentoml_sdk/service/", line 277, in inject_config load_config(override_defaults=override_defaults, use_version=2) File "/home/ssikki/.local/lib/python3.10/site-packages/bentoml/_internal/configuration/", line 191, in load_config BentoMLConfiguration( File "/home/ssikki/.local/lib/python3.10/site-packages/bentoml/_internal/configuration/", line 140, in init raise BentoMLConfigException( bentoml.exceptions.BentoMLConfigException: Invalid configuration file was given: Key 'services' error: Key 'llm-phi-service' error: Key 'resources' error: Or({Optional('cpu'): <class 'str'>, Optional('memory'): <class 'str'>, Optional('gpu'): And(<class 'numbers.Real'>, <function ensure_larger_than..v at 0x7f74122f8550>), Optional('gpu_type'): <class 'str'>, Optional('tpu_type'): <class 'str'>}, None) did not validate {'gpu': 0} Key 'gpu' error: v(0) should evaluate to True None does not match {'gpu': 0}


bentoml env

Environment variable


System information

bentoml: 1.2.17 python: 3.10.12 platform: Linux- uid_gid: 1000:1000

transformers-cli env

System information (Optional)

CPU: AMD Ryzen 5 5500U with Radeon Graphics GPU: (not in use) AMD Radeon(TM) Graphics RAM: 8GB Platform: WSL Ubuntu. The python interpreter is set to WSL already

aarnphm commented 1 month ago

at the moment openllm >0.5 requires GPU. I wonder if your AMD GPU is getting picked up correctly?

do you see usage on your GPU?

Said-Ikki commented 1 month ago

at the moment openllm >0.5 requires GPU. I wonder if your AMD GPU is getting picked up correctly?

do you see usage on your GPU?

I don't think there were any drivers for this GPU. in any case, I figured a GPU was necessary so I used my 4060 and I ran into some issues with vLLM. I also tried it on an Ubuntu VM without a GPU and that seemed to work the 'most correct' before it realized I didn't have a GPU. I will share more info about the 4060 in a sec but I think WSL makes things kinda wonky

Said-Ikki commented 1 month ago

after I run the following: openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code I get this error. What's interesting is that it sorta just doesn't stop, it keeps retrying and doesn't work regardless

/home/ssikki/.local/lib/python3.10/site-packages/huggingface_hub/ FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 931/931 [00:00<00:00, 11.0MB/s] A new version of the following files was downloaded from

aarnphm commented 1 month ago

This is a usage problem. You are running a models with 4k context length. This means the amount of GPU memory required for kv cache is ~ 4GB.

microsoft/Phi-3-mini-4k-instruct requires at least 8GB to load as fp16. So for 4060 this would leaves not a lot of memory left for KV cache. Check out --gpu-memory-utilization from vllm to configure this.

I would suggest to run on larger GPU, to be at least L4 for 4k context.

You can also try out quantization version.

Said-Ikki commented 1 month ago

dumb question: how can I run the quanitized version? And while I'm here, will I need to clear the cache out to make space or will it be fine?

aarnphm commented 1 month ago

Check out huggingface hub for pre-quantized models. vLLM currently only support pre-quantized models