Metric about compute apps

NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

Apache License 2.0

937 stars 164 forks source link

Metric about compute apps #94

Open onstring opened 2 years ago

onstring commented 2 years ago

Do we have any metrics / Is it worthy to add a metric about the GPU allocated compute process, just like the following output of nvidia-smi:

> nvidia-smi --query-compute-apps=gpu_uuid,name --format=csv
gpu_uuid, process_name
GPU-d0180485-9584-433c-6782-c335d5df2cb3, vgpu
GPU-777ead31-954e-837f-590f-6c4974d8e571, vgpu
GPU-777ead31-954e-837f-590f-6c4974d8e571, vgpu

nikkon-dev commented 2 years ago

Hi @onstring,

There are no such metrics as of today. DCGM does not have fields with such information, but there is an API to collect information about running PIDs.

What form would you want to see this information, and what utility should it have? I can imagine a metric with the total number of processes occupying a GPU, but I do not see how exact processes could be represented or used here. Could you elaborate?

onstring commented 2 years ago

The scenario is in our cloud platform, besides those instances using GPU, we also have many instances only using normal compute/CPU resources. So we would like to know the statistics about how many GPUs are occupied.

For example, from the above nvidia-smi output, we would like to know the number of processes(maybe processes names) for each GPU instance:

GPU-d0180485-9584-433c-6782-c335d5df2cb3, 1
GPU-777ead31-954e-837f-590f-6c4974d8e571, 2