bentoml / BentoML

The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
https://bentoml.com
Apache License 2.0
7.06k stars 781 forks source link

bug: fail to build bento on none-gpu env. because NvidiaGpuResource.from_system throws #4357

Open Epsilon314 opened 9 months ago

Epsilon314 commented 9 months ago

Describe the bug

Try to build a bento on a machine without Nvidia GPU and has cuda toolkit installed. It failed because method NvidiaGpuResource.from_system throws at

try:
            pynvml.nvmlInit()
            device_count = pynvml.nvmlDeviceGetCount()
            return list(range(device_count))
        except (pynvml.NVMLError_LibraryNotFound ,OSError):
            logger.debug("GPU not detected. Unable to initialize pynvml lib.")
            return []

The exception pynvml.NVMLError_DriverNotLoaded may also need to be catched, in case nvml presents but gpu not

To reproduce

No response

Expected behavior

No response

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.1.10 python: 3.8.18 platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.28 uid_gid: 1001:1001

pip_packages
``` accelerate==0.25.0 aiohttp==3.9.1 aiosignal==1.3.1 anyio==4.1.0 appdirs==1.4.4 asgiref==3.7.2 async-timeout==4.0.3 attrs==23.1.0 bentoml==1.1.10 bitsandbytes==0.41.3.post2 build==0.10.0 cattrs==23.1.2 certifi==2023.11.17 charset-normalizer==3.3.2 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 coloredlogs==15.0.1 contextlib2==21.6.0 cuda-python==12.3.0 datasets==2.15.0 deepmerge==1.1.0 Deprecated==1.2.14 diffusers==0.24.0 dill==0.3.7 distlib==0.3.8 distro==1.8.0 einops==0.7.0 exceptiongroup==1.2.0 fastcore==1.5.29 filelock==3.13.1 filetype==1.2.0 frozenlist==1.4.0 fs==2.4.16 fsspec==2023.10.0 ghapi==1.0.4 h11==0.14.0 httpcore==1.0.2 httpx==0.25.2 huggingface-hub==0.19.4 humanfriendly==10.0 idna==3.6 importlib-metadata==6.11.0 inflection==0.5.1 Jinja2==3.1.2 markdown-it-py==3.0.0 MarkupSafe==2.1.3 mdurl==0.1.2 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.15 mypy-extensions==1.0.0 networkx==3.1 numpy==1.24.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==11.525.150 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 onediffusion==0.0.3 openllm==0.4.36 openllm-client==0.4.36 openllm-core==0.4.36 opentelemetry-api==1.20.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 optimum==1.15.0 orjson==3.9.10 packaging==23.2 pandas==2.0.3 pathspec==0.12.1 Pillow==10.1.0 pip-requirements-parser==32.0.1 pip-tools==7.3.0 platformdirs==4.1.0 prometheus-client==0.19.0 protobuf==4.25.1 psutil==5.9.6 pyarrow==14.0.1 pyarrow-hotfix==0.6 pydantic==1.10.13 Pygments==2.17.2 pyparsing==3.1.1 pyproject_hooks==1.0.0 python-dateutil==2.8.2 python-json-logger==2.0.7 python-multipart==0.0.6 pytz==2023.3.post1 PyYAML==6.0.1 pyzmq==25.1.2 regex==2023.10.3 requests==2.31.0 rich==13.7.0 safetensors==0.4.1 schema==0.7.5 scipy==1.10.1 sentencepiece==0.1.99 simple-di==0.1.5 six==1.16.0 sniffio==1.3.0 starlette==0.33.0 sympy==1.12 tabulate==0.9.0 tokenizers==0.15.0 tomli==2.0.1 torch==2.1.1 tornado==6.4 tqdm==4.66.1 transformers==4.36.0 triton==2.1.0 typing_extensions==4.9.0 tzdata==2023.3 urllib3==2.1.0 uvicorn==0.24.0.post1 virtualenv==20.25.0 watchfiles==0.21.0 wcwidth==0.2.12 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4 zipp==3.17.0 ```
moshemalawach commented 8 months ago

Just adding "pynvml.NVMLError_DriverNotLoaded" to the list of exception fixed the bug.

Line after modification: except (pynvml.NVMLError_LibraryNotFound, pynvml.NVMLError_DriverNotLoaded, OSError):