NVML Shared Library Not Found on ARM64 platform with Mali GPU

artistlu commented 3 months ago

When trying to run the Exo project on an ARM64 platform with a Mali GPU, I am encountering the following error:

(exo) root@linaro-alip:/home/zhanglu/code/exo# python3.12 main.py None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

  _____  _____  
 / _ \ \/ / _ \ 
|  __/>  < (_) |
 \___/_/\_\___/ 

Detected system: Linux
Using inference engine: TinygradDynamicShardInferenceEngine
[58291, 53818, 58854]
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Traceback (most recent call last):
  File "/root/miniconda3/envs/exo/lib/python3.12/site-packages/pynvml/nvml.py", line 1798, in _LoadNvmlLibrary
    nvmlLib = CDLL("libnvidia-ml.so.1")
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/exo/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zhanglu/code/exo/main.py", line 44, in <module>
    node = StandardNode(args.node_id, None, inference_engine, discovery, partitioning_strategy=RingMemoryWeightedPartitioningStrategy(), chatgpt_api_endpoint=f"http://localhost:{args.chatgpt_api_port}/v1/chat/completions", web_chat_url=f"http://localhost:{args.chatgpt_api_port}", disable_tui=args.disable_tui, max_generate_tokens=args.max_generate_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanglu/code/exo/exo/orchestration/standard_node.py", line 26, in __init__
    self.device_capabilities = device_capabilities()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanglu/code/exo/exo/topology/device_capabilities.py", line 76, in device_capabilities
    return linux_device_capabilities()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanglu/code/exo/exo/topology/device_capabilities.py", line 106, in linux_device_capabilities
    pynvml.nvmlInit()
  File "/root/miniconda3/envs/exo/lib/python3.12/site-packages/pynvml/nvml.py", line 1770, in nvmlInit
    nvmlInitWithFlags(0)
  File "/root/miniconda3/envs/exo/lib/python3.12/site-packages/pynvml/nvml.py", line 1753, in nvmlInitWithFlags
    _LoadNvmlLibrary()
  File "/root/miniconda3/envs/exo/lib/python3.12/site-packages/pynvml/nvml.py", line 1800, in _LoadNvmlLibrary
    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
  File "/root/miniconda3/envs/exo/lib/python3.12/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

The error message indicates that the Exo project is attempting to use the NVIDIA Management Library (NVML), which is not available on ARM64 platforms with Mali GPUs.

stephanj commented 3 months ago

Claude Sonnet 3.5 response:

The issue is indeed that the code is trying to use NVIDIA-specific libraries on a non-NVIDIA system. The code needs modification to handle ARM64 platforms with Mali GPUs correctly. Here's how we can address this:

First, we need to modify the linux_device_capabilities() function in exo/topology/device_capabilities.py to detect and handle Mali GPUs. We'll need to add a check for Mali GPUs before trying to use NVIDIA libraries.
We'll use the pyudev library to detect the GPU. If you don't have it installed, you can install it with pip install pyudev.

Here's an updated version of the linux_device_capabilities() function:

import psutil
from tinygrad import Device
import pyudev

def linux_device_capabilities() -> DeviceCapabilities:
    if DEBUG >= 2: print(f"tinygrad {Device.DEFAULT=}")

    context = pyudev.Context()

    # Check for Mali GPU
    mali_gpu = next((device for device in context.list_devices(subsystem='mali0') if 'mali' in device.sys_name.lower()), None)
    if mali_gpu:
        gpu_name = mali_gpu.get('DEVNAME', 'Unknown Mali GPU')
        return DeviceCapabilities(
            model=f"Linux Box (ARM Mali)",
            chip=gpu_name,
            memory=psutil.virtual_memory().total // 2**20,
            flops=DeviceFlops(fp32=100, fp16=200, int8=400)  # Placeholder values, adjust as needed
        )

    # Check for NVIDIA GPU
    try:
        import pynvml
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        gpu_name = pynvml.nvmlDeviceGetName(handle)
        gpu_memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)

        if DEBUG >= 2: print(f"NVIDIA device {gpu_name=} {gpu_memory_info=}")

        return DeviceCapabilities(
            model=f"Linux Box ({gpu_name})",
            chip=gpu_name,
            memory=gpu_memory_info.total // 2**20,
            flops=CHIP_FLOPS.get(gpu_name, DeviceFlops(fp32=0, fp16=0, int8=0))
        )
    except ImportError:
        pass  # NVIDIA libraries not available
    except pynvml.NVMLError:
        pass  # NVIDIA GPU not found or error accessing it

    # Fallback for other devices
    return DeviceCapabilities(
        model=f"Linux Box (Device: {Device.DEFAULT})",
        chip=f"Unknown Chip (Device: {Device.DEFAULT})",
        memory=psutil.virtual_memory().total // 2**20,
        flops=DeviceFlops(fp32=0, fp16=0, int8=0)
    )

This updated function will:

First check for a Mali GPU using pyudev.
If a Mali GPU is found, it will return capabilities for that GPU.
If no Mali GPU is found, it will try to detect an NVIDIA GPU (keeping the existing NVIDIA detection logic).
If neither is found, it falls back to a generic device capability.

You'll need to add import pyudev at the top of the file.

Also, note that the FLOPS values for the Mali GPU are placeholders. You might want to adjust these based on the specific Mali GPU model you're using, or implement a more sophisticated detection method for Mali GPU capabilities.

This change should allow the Exo project to run on your ARM64 platform with a Mali GPU without throwing the NVIDIA-related error. It will detect the Mali GPU and provide some basic information about it.

Remember to install pyudev:

pip install pyudev

After making these changes, the project should be able to start without the NVIDIA error on your ARM64 platform with a Mali GPU.

artistlu commented 3 months ago

The issue has been resolved. Could you please merge the changes? Thank you. @stephanj

exo-explore / exo

NVML Shared Library Not Found on ARM64 platform with Mali GPU #76