XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

Support for AMD ROCm devices #123

Closed Junyi-99 closed 1 day ago

Junyi-99 commented 5 months ago

Issue Type

Description

I've implemented ROCm support in nvitop, enabling it to run on AMD GPUs. This feature has been tested on mi50, mi100, and mi210 machines and is confirmed to maintain full functionality for NVIDIA GPUs.

Motivation and Context

Really need nvitop on AMD GPUs.

74

Testing

Tested on

mi50

mi100

mi210

Images / Videos

mi100

(top: nvitop, bottom-left: rocm-smi, bottom-right: pytorch code)

XuehaiPan commented 5 months ago

Hi @Junyi-99, thanks for the contribution! Is there any PyPI package that provides the ROCm-SMI bindings like nvidia-ml-py for the NVIDIA NVML library? Maybe we should ship the ROCm support with:

pip3 install nvitop[rocm]
Junyi-99 commented 5 months ago

Oh, I think it's a very good suggestion to ship through nvitop[rocm]. Currently, there is a ROCm binding, but it is not that functional.

hartmark commented 4 weeks ago

+1 I'd love to have this support, how is the development going?

kswain55 commented 4 weeks ago

+1 It would be great to have this for MI300X

unclemusclez commented 2 weeks ago

trying this now with hf autotrain, AMD Radeon 7900XT Navi31 gfx1100 with pip install git+https://github.com/XuehaiPan/nvitop.git

I still receive the errors:

Your installed package `nvidia-ml-py` is corrupted. Skip patch functions `nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses`. You may get incorrect or incomplete results. Please consider reinstall package `nvidia-ml-py` via `pip3 install --force-reinstall nvidia-ml-py nvitop`.
Your installed package `nvidia-ml-py` is corrupted. Skip patch functions `nvmlDeviceGetMemoryInfo`. You may get incorrect or incomplete results. Please consider reinstall package `nvidia-ml-py` via `pip3 install --force-reinstall nvidia-ml-py nvitop`.
dmitrii-galantsev commented 2 days ago

@Junyi-99 Would it be possible to use the rocmsmi repo as a submodule instead? Are there any modifications beyond formatting?

Also please note that we're working on migration to AMDSMI and it would be much better long-term to use that :). ROCMSMI will eventually be deprecated.

In fact RDC migrated to amdsmi somewhat recently.

Cheers! -- Dev from SMI team at AMD.

dmitrii-galantsev commented 2 days ago

Is there any PyPI package that provides the ROCm-SMI bindings like nvidia-ml-py for the NVIDIA NVML library?

@XuehaiPan This is planned for amdsmi :)

dmitrii-galantsev commented 2 days ago

some more info.

GPU: 1 PROCESS_INFO: NAME: rvs PID: 468813 MEMORY_USAGE: GTT_MEM: 2.1 MB CPU_MEM: 253.1 MB VRAM_MEM: 1.1 GB MEM_USAGE: 1.4 GB USAGE: GFX: 0 ns ENC: 0 ns

unclemusclez commented 2 days ago

this works for wsl2?

some more info.

* You can build and install amdsmi python package fairly easily.
# if on ubuntu get dependencies:
# sudo apt install git python3 python3-pip cmake clang build-essential pkg-config libdrm-dev
git clone https://github.com/ROCm/amdsmi &&
cd amdsmi &&
cmake -B build &&
make -C build -j $(nproc) &&
cd build/py-interface/python_package &&
python3 -m pip install .

Now you should be able to use the api: https://github.com/ROCm/amdsmi/tree/amd-staging/py-interface#usage

* `amd-smi process` returns some useful info. Here is me running [rocm-validation-suite](https://github.com/ROCm/ROCmValidationSuite/) in the background on dual NV21s:
$ amd-smi process
GPU: 0
    PROCESS_INFO:
        NAME: rvs
        PID: 468813
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 253.1 MB
            VRAM_MEM: 1.1 GB
        MEM_USAGE: 1.4 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns

GPU: 1
    PROCESS_INFO:
        NAME: rvs
        PID: 468813
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 253.1 MB
            VRAM_MEM: 1.1 GB
        MEM_USAGE: 1.4 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns
dmitrii-galantsev commented 1 day ago

@unclemusclez AFAIK - no. SMI needs access to amdgpu driver. rule of thumb, if /sys/class/drm/card*/device/gpu_metrics exists - SMI will work.

Junyi-99 commented 1 day ago

@dmitrii-galantsev I'll try it this weekend.