NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
306 stars 62 forks source link

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

Open qisikai opened 2 years ago

qisikai commented 2 years ago

desc

I want to know whether it's possible to call nvmlDeviceGetMPSComputeRunningProcesses_v2 api using go-nvml

and when It can be supported

thanks

api

image

doc

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g02098e9876e3fb86eeb9cac2222e5b5d

XuehaiPan commented 2 years ago

The nvml.h version in this repository is from CUDA 11.2.2. It does not have the definition of function nvmlDeviceGetMPSComputeRunningProcesses_v2.

klueska commented 2 years ago

Yes. We are a bit behind in keeping go-nvml in sync with the latest NVML release. We hope to have a new version out next week.

XuehaiPan commented 2 years ago

We are a bit behind in keeping go-nvml in sync with the latest NVML release.

The latest devel image of CUDA at dockerhub is CUDA 11.4.2 (we use docker image to update nvml.h), which is behind the latest CUDA release (11.5) too.

qisikai commented 2 years ago

That's great, thanks

XuehaiPan commented 2 years ago

This issue is resolved by PR #38.

qisikai commented 2 years ago

when I use the lastest master branch, I got: (p[1]'s pid is wrong)

deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}

image

It works when I change ProcessInfo_v1's definition from

type ProcessInfo_v1 struct {
    Pid           uint32
    UsedGpuMemory uint64
}

to

type ProcessInfo_v1 struct {
    Pid           uint32
    UsedGpuMemory uint64
    GpuInstanceId     uint32
    ComputeInstanceId uint32
}
deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
qisikai commented 2 years ago

when I use the lastest master branch, I got: (p[1]'s pid is wrong)

deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}

image

It works when I change ProcessInfo_v1's definition from

type ProcessInfo_v1 struct {
  Pid           uint32
  UsedGpuMemory uint64
}

to

type ProcessInfo_v1 struct {
  Pid           uint32
  UsedGpuMemory uint64
  GpuInstanceId     uint32
  ComputeInstanceId uint32
}
deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

hi, PR #38 did't works. pls help, thanks @klueska @XuehaiPan environment: Tesla T4 + 450.80 / 450.142

klueska commented 2 years ago

This is obviously unexpected, and glancing at the code, it's not clear to me why / how this would happen.

Can you show me the output of the following on your machine:

$ objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses
qisikai commented 2 years ago

objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses

objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep nvmlDeviceGetMPSComputeRunningProcesses
000000000004cbd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base>:
4cbf9:       0f 8f d1 01 00 00       jg     4cdd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x200>
4cc08:       0f 84 9a 00 00 00       je     4cca8 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0xd8>
4cc12:       7f 14                   jg     4cc28 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x58>
4ccbd:       0f 84 7d 00 00 00       je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4cccb:       74 73                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4ccd4:       75 6a                   jne    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4ccdc:       0f 84 6e 01 00 00       je     4ce50 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x280>
4cce5:       0f 84 d5 01 00 00       je     4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0>
4ccee:       0f 84 cc 01 00 00       je     4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0>
4ccfa:       0f 84 ca 01 00 00       je     4ceca <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2fa>
4cd0c:       74 32                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4cd15:       75 29                   jne    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4cd1e:       74 20                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4cd49:       0f 8e c5 fe ff ff       jle    4cc14 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x44>
4ce41:       e9 b9 fd ff ff          jmpq   4cbff <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f>
4ce59:       0f 8e e1 fe ff ff       jle    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4ceb7:       e9 84 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4cec5:       e9 76 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
4cee4:       e9 57 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
qisikai commented 2 years ago

image

klueska commented 2 years ago

Interesting. I wouldn't have expected the R450 driver (i.e. NVML 11.0) to have a symbol defined for nvmlDeviceGetMPSComputeRunningProcesses because it doesn't appear in the NVML header file until R470 (i.e. NVML version 11.4).

Can you verify things are work as expected for the other similar functions, i.e. DeviceGetComputeRunningProcesses and DeviceGetGraphicsRunningProcesses, or do these have the same problem? If these are working but nvmlDeviceGetMPSComputeRunningProcesses is not, then my assumption below is likely true.

What I think is going on is that the binary for libnvidia-ml.so with NVML 11.0 actually had the symbol for nvmlDeviceGetMPSComputeRunningProcesses compiled into it even though it wasn't available in the NVML header for this version.

And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name. Since it wasn't officially part of the header yet, people (in theory) shouldn't have known about it or be able to use it.

Then when NVML 11.4 came out, it was officially added to the API, but only as a v2 function (since it clearly operates on the v2 struct). It was OK to not "backport" support for the v1 struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).

As such, we need to tell this original, unversioned function to actually operator on the v2 struct, even though that breaks the pattern from the other, similar functions (which were officially supported prior to the introduction of the v2 struct).

qisikai commented 2 years ago

@klueska DeviceGetComputeRunningProcesses\DeviceGetGraphicsRunningProcesses from the lastest master branch works as expected (and DeviceGetComputeRunningProcesses equals to deviceGetComputeRunningProcesses_v2).

and I also agree with this point of view: And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name

XuehaiPan commented 2 years ago

Then when NVML 11.4 came out, it was officially added to the API, but only as a v2 function (since it clearly operates on the v2 struct). It was OK to not "backport" support for the v1 struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).

In nvml.h (NVML 11.6 at master branch), the unversioned nvmlDeviceGetMPSComputeRunningProcesses uses struct nvmlProcessInfo_v1_t:

https://github.com/NVIDIA/go-nvml/blob/c3a16a2b07cf2251cbedb76fa68c9292b22bfa06/pkg/nvml/nvml.h#L8420-L8425

klueska commented 2 years ago

Right -- so I'm thinking that old driver versions (where the API wasn't published) were actually buggy by operating on the v2 struct and then when they published the API, they retroactively went back and "fixed" things for it to use the v1 struct. Not sure though -- will need to check internally.

klueska commented 2 years ago

As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses was never meant to be exposed in the libnvidia-ml.so.1 binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.

So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses in R450 that operated on nvmlProcessInfo_v2_t structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2 API which operates on v2 structs.

So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.

qisikai commented 2 years ago

As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses was never meant to be exposed in the libnvidia-ml.so.1 binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.

So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses in R450 that operated on nvmlProcessInfo_v2_t structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2 API which operates on v2 structs.

So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.


How about this idea: use nvmlDeviceGetMPSComputeRunningProcesses v2 for drivers < R470 (like 450.80 .etc). for drivers >= 470, then use current logic.

I find that nvidia-smi with 450.80 driver can get the right result of nvmlDeviceGetMPSComputeRunningProcesses, so go-nvml need to provide a way to do the same thing.