GetComputeRunningProcesses does not work for multiple processes

saifhaq commented 1 year ago

Hey NVIDIA team,

When multiple processes are on one GPU, the output of device.GetComputeRunningProcesses() is wrong. This is on CUDA 12.2, and this bug seems very similar to the closed bug here which occurred on an earlier CUDA version.

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN RTX               Off | 00000000:19:00.0 Off |                  N/A |
| 40%   44C    P2              61W / 280W |   4844MiB / 24576MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX               Off | 00000000:1A:00.0 Off |                  N/A |
| 41%   26C    P8              15W / 280W |      6MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN RTX               Off | 00000000:67:00.0 Off |                  N/A |
| 41%   25C    P8              13W / 280W |      6MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN RTX               Off | 00000000:68:00.0 Off |                  N/A |
| 41%   24C    P8               3W / 280W |     14MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2395      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A    452612      C   python                                     2814MiB |
|    0   N/A  N/A   3632277      C   python                                     2020MiB |
|    1   N/A  N/A      2395      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      2395      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      2395      G   /usr/lib/xorg/Xorg                            8MiB |
|    3   N/A  N/A      2442      G   /usr/bin/gnome-shell                          3MiB |
+---------------------------------------------------------------------------------------+

//main.go
package main

import "fmt"
import "github.com/NVIDIA/go-nvml/pkg/nvml"

func main() {
    nvml.Init()
    device, _ := nvml.DeviceGetHandleByIndex(0)
    processInfos, _ := device.GetComputeRunningProcesses()
    for i, processInfo := range processInfos {
        fmt.Printf("\t[%2d] ProcessInfo: %v\n", i, processInfo)
    }
}

go run main.go

    [ 0] ProcessInfo: {3632277 2118123520 4294967295 4294967295}
    [ 1] ProcessInfo: {0 452612 2950692864 0}

The PID of the process [1] is shown to be 0, and the actual PID is available under the gpu memory usage.

I managed to also test this on an A100, and the bug does not happen on that card on CUDA 12.2.

ErmakovDmitriy commented 8 months ago

Hi, I also could observe this issue on my V100 and Cuda 12.2 meanwhile A100 and Cuda 12.0 work well.

I tried to "see" what I get from the nvmlDeviceGetComputeRunningProcesses_v3 call. I changed the deviceGetComputeRunningProcesses_v3 function to print bytes, as below:

func deviceGetComputeRunningProcesses_v3(Device Device) ([]ProcessInfo, Return) {
    var InfoCount uint32 = 1 // Will be reduced upon returning
    for {
        Infos := make([]ProcessInfo, InfoCount)
        ret := nvmlDeviceGetComputeRunningProcesses_v3(Device, &InfoCount, &Infos[0])
        if ret == SUCCESS {
            fmt.Printf("### Start: deviceGetComputeRunningProcesses_v3: Start length %d\n", InfoCount)

            for i := 0; i < int(InfoCount); i++ {
                ptr := unsafe.Pointer(&Infos[i])
                vLen := int(unsafe.Sizeof(Infos[i]))
                v := (unsafe.Slice((*byte)(ptr), vLen))

                fmt.Printf("Info[%d], ptr %x, bytes_ptr %x, bytes_len %d\n", i, ptr, v, vLen)

                fmt.Printf("Info[%d]: %+v\n", i, v[:vLen])
            }

            ptr := unsafe.Pointer(&Infos[0])
            vLen := (int)(unsafe.Sizeof(Infos[0]))*int(InfoCount) + 16 // Run for A100 was with +8 but the A100 works anyway
            v := unsafe.Slice((*byte)(ptr), vLen)

            fmt.Printf("All bytes, as is:")
            for i := 0; i < vLen; i++ {
                fmt.Printf("%#x, ", v[i])
            }
            fmt.Println("")

            fmt.Println("### End: deviceGetComputeRunningProcesses_v3\n")

            return Infos[:InfoCount], ret
        }
        if ret != ERROR_INSUFFICIENT_SIZE {
            return nil, ret
        }
        InfoCount *= 2
    }
}

It seems that there is an extra 8 bytes between nvmlProcessInfo_ts in the returned result on the V100 and Cuda 12.2, see below.

Layout (as it is in nvml.h):

typedef struct nvmlProcessInfo_st
{
    unsigned int        pid;                //!< Process ID Offset: 0 bytes, Size: 4 bytes (+4 bytes padding)
    unsigned long long  usedGpuMemory;      //!< Amount of used GPU memory in bytes. Offset: 8 bytes Size: 8 bytes
                                            //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                            //! because Windows KMD manages all the memory and not the NVIDIA driver
    unsigned int        gpuInstanceId;      //!< If MIG is enabled, stores a valid GPU instance ID. gpuInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
                                            // Offset: 16 bytes Size: 4 bytes
    unsigned int        computeInstanceId;  //!< If MIG is enabled, stores a valid compute instance ID. computeInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
                                           // Offset: 20 bytes Size: 4 bytes
} nvmlProcessInfo_t;

So the total size of the structure is supposed to be 24 bytes. Also, if on my x86_64 PC I create a simple file, include nvml.h and type:

nvmlProcessInfo_t data[2];
int x = sizeof(data);

then I can see in clangd language server (did not try to compile though) that x is 48 bytes.

A100 Cuda 12.0 (working well):

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0     2    0   3547845      C   ...el-environment/bin/python      2.0GiB |
|    0     2    0   3548721      C   ...el-environment/bin/python      2.0GiB |

Output of the deviceGetComputeRunningProcesses_v3:

### Start: deviceGetComputeRunningProcesses_v3: Start length 2
Info[0], ptr c000a88720, bytes_ptr c52236000000000000002083000000000200000000000000, bytes_len 24
Info[0]: [197 34 54 0 0 0 0 0 0 0 32 131 0 0 0 0 2 0 0 0 0 0 0 0]
Info[1], ptr c000a88738, bytes_ptr 312636000000000000002083000000000200000000000000, bytes_len 24
Info[1]: [49 38 54 0 0 0 0 0 0 0 32 131 0 0 0 0 2 0 0 0 0 0 0 0]
All bytes, as is: ...
### End: deviceGetComputeRunningProcesses_v3

Here is a split of the All bytes with my comments:

First process info: 24 bytes: 0xc5, 0x22, 0x36, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20, 0x83, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, Second process info: 24 bytes: 0x31, 0x26, 0x36, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20, 0x83, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,

The values of the second process: 0x31, 0x26, 0x36, 0x0, - PID, (0x36 << (8*2)) + (0x26 << (8*1)) + 0x31 => 3548721 0x0, 0x0, 0x0, 0x0, - structure alignment padding 0x0, 0x0, 0x20, 0x83, 0x0, 0x0, 0x0, 0x0,- usedGpuMemory, (0x83<<8*3) + (0x20 << 8*2) => 2199912448 => 2.048828125 GB 0x2, 0x0, 0x0, 0x0, - gpuInstanceId, 0x2 => 2 (see GI ID) 0x0, 0x0, 0x0, 0x0, - computeInstanceId, 0x0 => 0 (CI ID)

Here is what I get from V100 and Cuda 12.2:

V100 Cuda 12.2:

### Start: deviceGetComputeRunningProcesses_v3: Start length 2
Info[0], ptr c000ccb140, bytes_ptr ac553900000000000000608b00000000ffffffffffffffff, bytes_len 24
Info[0]: [172 85 57 0 0 0 0 0 0 0 96 139 0 0 0 0 255 255 255 255 255 255 255 255]
Info[1], ptr c000ccb158, bytes_ptr 000000000000000073573900000000000000607b00000000, bytes_len 24
Info[1]: [0 0 0 0 0 0 0 0 115 87 57 0 0 0 0 0 0 0 96 123 0 0 0 0]
All bytes, as is: 0xac, 0x55, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x8b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x73, 0x57, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x7b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
### End: deviceGetComputeRunningProcesses_v3

All bytes (split to chunks): First process info: 24 bytes: 0xac, 0x55, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x8b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, Looks like there is an extra 8 bytes padding (undocumented) between the first process and the second process: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, Second process info: 24 bytes: 0x73, 0x57, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x7b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, And another empty 8 bytes: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,

If we take into account the "undocumented 8 bytes padding" (skip it), then: For the second process: 0x73, 0x57, 0x39, 0x0, - PID, (0x73 << (8*2)) + (0x57 << (8*1)) + 0x39 => 7558969 0x0, 0x0, 0x0, 0x0, - structure alignment padding 0x0, 0x0, 0x60, 0x7b, 0x0, 0x0, 0x0, 0x0, - usedGpuMemory, (0x7b<<8*3) + (0x60 << 8*2) => 2069889024 => 1.927734375 GB (somewhat expected value) 0xff, 0xff, 0xff, 0xff, - gpuInstanceId, it is said in nvml.h that if there is not MIG, the value is 0xFFFFFFFF 0xff, 0xff, 0xff, 0xff, - computeInstanceId, it is said in nvml.h that if there is not MIG, the value is 0xFFFFFFFF

I don't see any obvious solution for the issue for now.

ErmakovDmitriy commented 8 months ago

An interesting observation:

When I commented out deviceGetComputeRunningProcesses_v3 symbol lookup ( https://github.com/NVIDIA/go-nvml/blob/v0.12.0-1/pkg/nvml/init.go#L190 ) so that the application uses deviceGetComputeRunningProcesses_v2, the result is correct:

### Start: deviceGetComputeRunningProcesses_v2: Start length 2
Info[0], ptr c0000f0a80, bytes_ptr 281d2a00000000000000608b00000000ffffffffffffffff, bytes_len 24
Info[0]: [40 29 42 0 0 0 0 0 0 0 96 139 0 0 0 0 255 255 255 255 255 255 255 255]
Info[1], ptr c0000f0a98, bytes_ptr f8262a00000000000000604b00000000ffffffffffffffff, bytes_len 24
Info[1]: [248 38 42 0 0 0 0 0 0 0 96 75 0 0 0 0 255 255 255 255 255 255 255 255]
All bytes, as is:0x28, 0x1d, 0x2a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x8b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xf8, 0x26, 0x2a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x4b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,  (ignore the last 16 zeroed bytes)
### End: deviceGetComputeRunningProcesses_v2

and the processes are detected as expected:

|=======================================================================================|
|    0   N/A  N/A   2759976      C   .../hdf5-kernel-environment/bin/python      2.2GiB |
|    0   N/A  N/A   2762488      C   .../hdf5-kernel-environment/bin/python      1.2GiB |
+---------------------------------------------------------------------------------------+

So it looks like there is an issue with deviceGetComputeRunningProcesses_v3 on V100 + Cuda 12.2. It might be that the issue is caused by the Cuda version or by the combination of it with the GPU model. I don't know, I don't have an opportunity to test different Cuda versions on the same host as well as different NVIDIA driver versions.

ErmakovDmitriy commented 7 months ago

We upgraded NVIDIA drivers on our servers with V100 and it seems that it fixed the issue. The upgrade was between versions: 535.54.03 -> 535.154.05, CUDA 12.2.

NVIDIA / go-nvml

GetComputeRunningProcesses does not work for multiple processes #75