NVIDIA / nim-deploy

A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deployment.
https://build.nvidia.com/
Apache License 2.0
141 stars 64 forks source link

nvml error when deploying NIM to AWS sagemaker #98

Open nkumaraws opened 1 month ago

nkumaraws commented 1 month ago

I am trying to deploy LLama 3.1-8B instruct model NIM on Sagemaker as an endpoint following this notebook: https://github.com/NVIDIA/nim-deploy/blob/main/cloud-service-providers/aws/sagemaker/nim_llama3.ipynb

I am using p4d instance and I am running into this error:

Traceback (most recent call last): File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/usr/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__ func = self.__getitem__(name) File "/usr/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__ func = self._FuncPtr((name_or_ordinal, self))

AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMemoryInfo_v2

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 99, in <module> main() File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 42, in main inference_env = prepare_environment() File "/opt/nim/llm/vllm_nvext/entrypoints/args.py", line 143, in prepare_environment engine_args, extracted_name = inject_ngc_hub(engine_args) File "/opt/nim/llm/vllm_nvext/hub/ngc_injector.py", line 164, in inject_ngc_hub system = get_hardware_spec() File "/opt/nim/llm/vllm_nvext/hub/hardware_inspect.py", line 289, in get_hardware_spec device_mem_total, device_mem_free, device_mem_used, device_mem_reserved = gpus.device_mem(device_id) File "/opt/nim/llm/vllm_nvext/hub/hardware_inspect.py", line 119, in device_mem mem_data = pynvml.nvmlDeviceGetMemoryInfo(handle, version=pynvml.nvmlMemory_v2) File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 2438, in nvmlDeviceGetMemoryInfo fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2") File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)

I cannot figure out what is happening here and why this failure occurs. Any help is deeply appreciated.

kshitizgupta21 commented 5 days ago

The error happens because Sagemaker driver for p4d and g5 instances is still running on older 470 version. To use the latest driver on sagemaker p4d/p4de/g5 instances you can add 'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2' within ProductionVariants inside EndpointConfig like shown in this notebook ,Image