Working with dcgm-exporter

ritazh commented 1 month ago

Was trying to get dcgm-exporter working after installing this, but helm install errored with

Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system.

running ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* on the host shows the files, but inside the kind worker node shows nothing.

Running the GPU operator helped but should we avoid running the gpu operator and this DRA plugin together? Is there a way to not have to install the operator to get NVML?

klueska commented 1 month ago

Ignoring the error you are facing for a moment -- even if you got DCGM exporter running, it would not show any GPU metrics. dgcm-exporter relies on the PodResources API to gather and report its GPU metrics, and dcgm-exporter has not yet been updated to consume information about GPUs allocated via DRA.

ritazh commented 1 month ago

I see. FWIW, after installing the gpu operator in the same cluster I have the DRA plugin, the dcgm-exporter that comes with the gpu operator was getting GPU metrics from the running distributed inference model off the mig devices in the cluster.

Example output:

# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active.
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="8",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.029399
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.033893
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="11",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.034816
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="12",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active.
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="8",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.002098
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.002359
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.002094
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="11",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.001672
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="12",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data.
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="8",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.015358
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.017687
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.015245
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="11",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.019403
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="12",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000

Are you saying these metrics may not be accurate?

If they are not accurate and we wish to get some GPU metrics from this cluster running the DRA driver, what would you recommend for us to try?

klueska commented 1 month ago

These metrics are accurate, but you won't get any of the per-pod GPU metrics that you normallly get with GPUs allocated via the standard device plugin.

ritazh commented 1 month ago

should we avoid running the gpu operator and this DRA plugin together? what is the roadmap for this plugin and the operator?

klueska commented 1 month ago

Nothing has been integrated with the GPU Operator yet. We have plans to do that soon, but will not make any commitments until it is confirmed when DRA is going beta upstream.

NVIDIA / k8s-dra-driver

Working with dcgm-exporter #166