Open saifhaq opened 1 year ago
Hi, I also could observe this issue on my V100 and Cuda 12.2 meanwhile A100 and Cuda 12.0 work well.
I tried to "see" what I get from the nvmlDeviceGetComputeRunningProcesses_v3
call. I changed the deviceGetComputeRunningProcesses_v3
function to print bytes, as below:
func deviceGetComputeRunningProcesses_v3(Device Device) ([]ProcessInfo, Return) {
var InfoCount uint32 = 1 // Will be reduced upon returning
for {
Infos := make([]ProcessInfo, InfoCount)
ret := nvmlDeviceGetComputeRunningProcesses_v3(Device, &InfoCount, &Infos[0])
if ret == SUCCESS {
fmt.Printf("### Start: deviceGetComputeRunningProcesses_v3: Start length %d\n", InfoCount)
for i := 0; i < int(InfoCount); i++ {
ptr := unsafe.Pointer(&Infos[i])
vLen := int(unsafe.Sizeof(Infos[i]))
v := (unsafe.Slice((*byte)(ptr), vLen))
fmt.Printf("Info[%d], ptr %x, bytes_ptr %x, bytes_len %d\n", i, ptr, v, vLen)
fmt.Printf("Info[%d]: %+v\n", i, v[:vLen])
}
ptr := unsafe.Pointer(&Infos[0])
vLen := (int)(unsafe.Sizeof(Infos[0]))*int(InfoCount) + 16 // Run for A100 was with +8 but the A100 works anyway
v := unsafe.Slice((*byte)(ptr), vLen)
fmt.Printf("All bytes, as is:")
for i := 0; i < vLen; i++ {
fmt.Printf("%#x, ", v[i])
}
fmt.Println("")
fmt.Println("### End: deviceGetComputeRunningProcesses_v3\n")
return Infos[:InfoCount], ret
}
if ret != ERROR_INSUFFICIENT_SIZE {
return nil, ret
}
InfoCount *= 2
}
}
It seems that there is an extra 8 bytes between nvmlProcessInfo_t
s in the returned result on the V100 and Cuda 12.2, see below.
Layout (as it is in nvml.h):
typedef struct nvmlProcessInfo_st
{
unsigned int pid; //!< Process ID Offset: 0 bytes, Size: 4 bytes (+4 bytes padding)
unsigned long long usedGpuMemory; //!< Amount of used GPU memory in bytes. Offset: 8 bytes Size: 8 bytes
//! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
//! because Windows KMD manages all the memory and not the NVIDIA driver
unsigned int gpuInstanceId; //!< If MIG is enabled, stores a valid GPU instance ID. gpuInstanceId is set to
// 0xFFFFFFFF otherwise.
// Offset: 16 bytes Size: 4 bytes
unsigned int computeInstanceId; //!< If MIG is enabled, stores a valid compute instance ID. computeInstanceId is set to
// 0xFFFFFFFF otherwise.
// Offset: 20 bytes Size: 4 bytes
} nvmlProcessInfo_t;
So the total size of the structure is supposed to be 24 bytes. Also, if on my x86_64 PC I create a simple file, include nvml.h and type:
nvmlProcessInfo_t data[2];
int x = sizeof(data);
then I can see in clangd language server (did not try to compile though) that x is 48 bytes.
A100 Cuda 12.0 (working well):
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 2 0 3547845 C ...el-environment/bin/python 2.0GiB |
| 0 2 0 3548721 C ...el-environment/bin/python 2.0GiB |
Output of the deviceGetComputeRunningProcesses_v3
:
### Start: deviceGetComputeRunningProcesses_v3: Start length 2
Info[0], ptr c000a88720, bytes_ptr c52236000000000000002083000000000200000000000000, bytes_len 24
Info[0]: [197 34 54 0 0 0 0 0 0 0 32 131 0 0 0 0 2 0 0 0 0 0 0 0]
Info[1], ptr c000a88738, bytes_ptr 312636000000000000002083000000000200000000000000, bytes_len 24
Info[1]: [49 38 54 0 0 0 0 0 0 0 32 131 0 0 0 0 2 0 0 0 0 0 0 0]
All bytes, as is: ...
### End: deviceGetComputeRunningProcesses_v3
Here is a split of the All bytes with my comments:
First process info: 24 bytes: 0xc5, 0x22, 0x36, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20, 0x83, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
Second process info: 24 bytes: 0x31, 0x26, 0x36, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20, 0x83, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
The values of the second process:
0x31, 0x26, 0x36, 0x0,
- PID, (0x36 << (8*2)) + (0x26 << (8*1)) + 0x31
=> 3548721
0x0, 0x0, 0x0, 0x0,
- structure alignment padding
0x0, 0x0, 0x20, 0x83, 0x0, 0x0, 0x0, 0x0,
- usedGpuMemory, (0x83<<8*3) + (0x20 << 8*2)
=> 2199912448 => 2.048828125 GB
0x2, 0x0, 0x0, 0x0,
- gpuInstanceId, 0x2 => 2 (see GI ID)
0x0, 0x0, 0x0, 0x0,
- computeInstanceId, 0x0 => 0 (CI ID)
Here is what I get from V100 and Cuda 12.2:
V100 Cuda 12.2:
### Start: deviceGetComputeRunningProcesses_v3: Start length 2
Info[0], ptr c000ccb140, bytes_ptr ac553900000000000000608b00000000ffffffffffffffff, bytes_len 24
Info[0]: [172 85 57 0 0 0 0 0 0 0 96 139 0 0 0 0 255 255 255 255 255 255 255 255]
Info[1], ptr c000ccb158, bytes_ptr 000000000000000073573900000000000000607b00000000, bytes_len 24
Info[1]: [0 0 0 0 0 0 0 0 115 87 57 0 0 0 0 0 0 0 96 123 0 0 0 0]
All bytes, as is: 0xac, 0x55, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x8b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x73, 0x57, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x7b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
### End: deviceGetComputeRunningProcesses_v3
All bytes (split to chunks):
First process info: 24 bytes: 0xac, 0x55, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x8b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
Looks like there is an extra 8 bytes padding (undocumented) between the first process
and the second process
: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
Second process info: 24 bytes: 0x73, 0x57, 0x39, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x7b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
And another empty 8 bytes: 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
If we take into account the "undocumented 8 bytes padding" (skip it), then:
For the second process:
0x73, 0x57, 0x39, 0x0,
- PID, (0x73 << (8*2)) + (0x57 << (8*1)) + 0x39
=> 7558969
0x0, 0x0, 0x0, 0x0,
- structure alignment padding
0x0, 0x0, 0x60, 0x7b, 0x0, 0x0, 0x0, 0x0,
- usedGpuMemory, (0x7b<<8*3) + (0x60 << 8*2)
=> 2069889024 => 1.927734375 GB (somewhat expected value)
0xff, 0xff, 0xff, 0xff,
- gpuInstanceId, it is said in nvml.h that if there is not MIG, the value is 0xFFFFFFFF
0xff, 0xff, 0xff, 0xff,
- computeInstanceId, it is said in nvml.h that if there is not MIG, the value is 0xFFFFFFFF
I don't see any obvious solution for the issue for now.
An interesting observation:
When I commented out deviceGetComputeRunningProcesses_v3
symbol lookup ( https://github.com/NVIDIA/go-nvml/blob/v0.12.0-1/pkg/nvml/init.go#L190 ) so that the application uses deviceGetComputeRunningProcesses_v2
, the result is correct:
### Start: deviceGetComputeRunningProcesses_v2: Start length 2
Info[0], ptr c0000f0a80, bytes_ptr 281d2a00000000000000608b00000000ffffffffffffffff, bytes_len 24
Info[0]: [40 29 42 0 0 0 0 0 0 0 96 139 0 0 0 0 255 255 255 255 255 255 255 255]
Info[1], ptr c0000f0a98, bytes_ptr f8262a00000000000000604b00000000ffffffffffffffff, bytes_len 24
Info[1]: [248 38 42 0 0 0 0 0 0 0 96 75 0 0 0 0 255 255 255 255 255 255 255 255]
All bytes, as is:0x28, 0x1d, 0x2a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x8b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xf8, 0x26, 0x2a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x60, 0x4b, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, (ignore the last 16 zeroed bytes)
### End: deviceGetComputeRunningProcesses_v2
and the processes are detected as expected:
|=======================================================================================|
| 0 N/A N/A 2759976 C .../hdf5-kernel-environment/bin/python 2.2GiB |
| 0 N/A N/A 2762488 C .../hdf5-kernel-environment/bin/python 1.2GiB |
+---------------------------------------------------------------------------------------+
So it looks like there is an issue with deviceGetComputeRunningProcesses_v3
on V100 + Cuda 12.2.
It might be that the issue is caused by the Cuda version or by the combination of it with the GPU model. I don't know, I don't have an opportunity to test different Cuda versions on the same host as well as different NVIDIA driver versions.
We upgraded NVIDIA drivers on our servers with V100 and it seems that it fixed the issue. The upgrade was between versions: 535.54.03 -> 535.154.05, CUDA 12.2.
Hey NVIDIA team,
When multiple processes are on one GPU, the output of
device.GetComputeRunningProcesses()
is wrong. This is on CUDA 12.2, and this bug seems very similar to the closed bug here which occurred on an earlier CUDA version.The PID of the process [1] is shown to be 0, and the actual PID is available under the gpu memory usage.
I managed to also test this on an A100, and the bug does not happen on that card on CUDA 12.2.