Open qisikai opened 2 years ago
The nvml.h
version in this repository is from CUDA 11.2.2. It does not have the definition of function nvmlDeviceGetMPSComputeRunningProcesses_v2
.
Yes. We are a bit behind in keeping go-nvml in sync with the latest NVML release. We hope to have a new version out next week.
We are a bit behind in keeping go-nvml in sync with the latest NVML release.
The latest devel image of CUDA at dockerhub is CUDA 11.4.2 (we use docker image to update nvml.h
), which is behind the latest CUDA release (11.5) too.
That's great, thanks
This issue is resolved by PR #38.
when I use the lastest master branch, I got: (p[1]'s pid is wrong)
deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}
It works when I change ProcessInfo_v1
's definition from
type ProcessInfo_v1 struct {
Pid uint32
UsedGpuMemory uint64
}
to
type ProcessInfo_v1 struct {
Pid uint32
UsedGpuMemory uint64
GpuInstanceId uint32
ComputeInstanceId uint32
}
deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
when I use the lastest master branch, I got: (p[1]'s pid is wrong)
deviceGetMPSComputeRunningProcesses_v1 called mp[0] = {Pid:4126076 UsedGpuMemory:2928672768} mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}
It works when I change
ProcessInfo_v1
's definition fromtype ProcessInfo_v1 struct { Pid uint32 UsedGpuMemory uint64 }
to
type ProcessInfo_v1 struct { Pid uint32 UsedGpuMemory uint64 GpuInstanceId uint32 ComputeInstanceId uint32 }
deviceGetMPSComputeRunningProcesses_v1 called p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
hi, PR #38 did't works. pls help, thanks @klueska @XuehaiPan environment: Tesla T4 + 450.80 / 450.142
This is obviously unexpected, and glancing at the code, it's not clear to me why / how this would happen.
Can you show me the output of the following on your machine:
$ objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses
objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses
objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep nvmlDeviceGetMPSComputeRunningProcesses 000000000004cbd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base>: 4cbf9: 0f 8f d1 01 00 00 jg 4cdd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x200> 4cc08: 0f 84 9a 00 00 00 je 4cca8 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0xd8> 4cc12: 7f 14 jg 4cc28 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x58> 4ccbd: 0f 84 7d 00 00 00 je 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4cccb: 74 73 je 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4ccd4: 75 6a jne 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4ccdc: 0f 84 6e 01 00 00 je 4ce50 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x280> 4cce5: 0f 84 d5 01 00 00 je 4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0> 4ccee: 0f 84 cc 01 00 00 je 4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0> 4ccfa: 0f 84 ca 01 00 00 je 4ceca <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2fa> 4cd0c: 74 32 je 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4cd15: 75 29 jne 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4cd1e: 74 20 je 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4cd49: 0f 8e c5 fe ff ff jle 4cc14 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x44> 4ce41: e9 b9 fd ff ff jmpq 4cbff <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f> 4ce59: 0f 8e e1 fe ff ff jle 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4ceb7: e9 84 fe ff ff jmpq 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4cec5: e9 76 fe ff ff jmpq 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170> 4cee4: e9 57 fe ff ff jmpq 4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
Interesting. I wouldn't have expected the R450 driver (i.e. NVML 11.0) to have a symbol defined for nvmlDeviceGetMPSComputeRunningProcesses
because it doesn't appear in the NVML header file until R470 (i.e. NVML version 11.4).
Can you verify things are work as expected for the other similar functions, i.e. DeviceGetComputeRunningProcesses
and DeviceGetGraphicsRunningProcesses
, or do these have the same problem? If these are working but nvmlDeviceGetMPSComputeRunningProcesses
is not, then my assumption below is likely true.
What I think is going on is that the binary for libnvidia-ml.so
with NVML 11.0 actually had the symbol for nvmlDeviceGetMPSComputeRunningProcesses
compiled into it even though it wasn't available in the NVML header for this version.
And I'm guessing that it already operated on the v2
version of the process struct even though it didn't explicitly have a v2
in its function name. Since it wasn't officially part of the header yet, people (in theory) shouldn't have known about it or be able to use it.
Then when NVML 11.4 came out, it was officially added to the API, but only as a v2
function (since it clearly operates on the v2
struct). It was OK to not "backport" support for the v1
struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml
library (which is supposed to be usable on all versions of NVML 11.0+).
As such, we need to tell this original, unversioned function to actually operator on the v2
struct, even though that breaks the pattern from the other, similar functions (which were officially supported prior to the introduction of the v2
struct).
@klueska
DeviceGetComputeRunningProcesses\DeviceGetGraphicsRunningProcesses
from the lastest master branch works as expected (and DeviceGetComputeRunningProcesses equals to deviceGetComputeRunningProcesses_v2).
and I also agree with this point of view:
And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name
Then when NVML 11.4 came out, it was officially added to the API, but only as a
v2
function (since it clearly operates on thev2
struct). It was OK to not "backport" support for thev1
struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).
In nvml.h
(NVML 11.6 at master branch), the unversioned nvmlDeviceGetMPSComputeRunningProcesses
uses struct nvmlProcessInfo_v1_t
:
Right -- so I'm thinking that old driver versions (where the API wasn't published) were actually buggy by operating on the v2
struct and then when they published the API, they retroactively went back and "fixed" things for it to use the v1
struct. Not sure though -- will need to check internally.
As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses
was never meant to be exposed in the libnvidia-ml.so.1
binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.
So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses
in R450 that operated on nvmlProcessInfo_v2_t
structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t
structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2
API which operates on v2
structs.
So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses
is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.
As I suspected I got confirmation from the NVML team that this is what happened. The function
nvmlDeviceGetMPSComputeRunningProcesses
was never meant to be exposed in thelibnvidia-ml.so.1
binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.So what I said before is exactly what happened -- there was a hidden version of
nvmlDeviceGetMPSComputeRunningProcesses
in R450 that operated onnvmlProcessInfo_v2_t
structs, but since it was never meant to be exposed they "fixed" it to operator onnvmlProcessInfo_v1_t
structs when it was offically released in R70 along side anvmlDeviceGetMPSComputeRunningProcesses_v2
API which operates onv2
structs.So the short of it is that
nvmlDeviceGetMPSComputeRunningProcesses
is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.
How about this idea:
use nvmlDeviceGetMPSComputeRunningProcesses v2
for drivers < R470 (like 450.80 .etc).
for drivers >= 470, then use current logic.
I find that nvidia-smi
with 450.80 driver can get the right result of nvmlDeviceGetMPSComputeRunningProcesses
, so go-nvml
need to provide a way to do the same thing.
desc
I want to know whether it's possible to call
nvmlDeviceGetMPSComputeRunningProcesses_v2
api usinggo-nvml
and when It can be supported
thanks
api
doc
https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g02098e9876e3fb86eeb9cac2222e5b5d