Closed Lukikay closed 1 year ago
One side note is only CUDA 11.8 is fully supported
Recently our scheduling strategy change such that we will only create 1 runner instance regardless of the device
. This option is controlled via --workers-per-resource
--device
indicate what GPU is available to the Runner. --workers-per-resource
is probably what you want here
Hi, thanks for your suggestion. I tried on another machine (RTX 4090 with CUDA 11.8), unfortunately, I got the same error Error: [bentoml-cli] "serve" failed: Model 'pt-baichuan-inc-baichuan-13b-chat:a4a558127068f2ce965aa56aeb826bf501a68970' is not found in BentoML store <osfs '/root/bentoml/models'>, you may need to run "bentoml models pull" first
Can you show the whole stack trace?
Describe the bug
Hi there, thanks for providing this brilliant work!
I cannot run Baichuan-13B-Chat model successfully, it said the model
is not found in BentoML store <osfs '/root/bentoml/models'>, you may need to run `bentoml models pull` first
However, I found the safetensors files were already generated in
/root/bentoml/models
Thanks in advance
To reproduce
openllm start baichuan --model-id baichuan-inc/Baichuan-13B-Chat --device 0,1 --debug
Logs
Environment
Python: 3.10.12 CUDA: 11.2.2 openllm: 0.2.25 bentoml: 1.1.1
System information (Optional)
RAM: 256G GPU: 4 * RTX 3090 running in docker container