The pod for a given GPU in k8s mode cannot be captured

rokkiter commented 5 months ago

What happened?

Unable to collect GPU metrics for relevant pods when using passthrough mode. For example, dcgm-exporter does not collect metrics when a VM created with kubevirt mounts a GPU in passthrough mode.

kubevirt vmi yaml for mounting GPUs

spec:
  domain:
    devices:
      ...
      gpus:
      - deviceName: nvidia.com/GP104GL_TESLA_P4
        name: gpu1

The resource of the kubevirt launcher pod that needs to be monitored.

resources:
  ...
  requests:
    ...
    nvidia.com/GP104GL_TESLA_P4: "1"

I have some GPU cards mounted in my cluster and from kubectl describe node I can get the following information.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests           Limits
  --------                       --------           ------
  ...
  nvidia.com/GP104GL_TESLA_P4    2                  2
  nvidia.com/GRID_P4-1Q          0                  0
  nvidia.com/GRID_P4-4Q          0                  0

In this case, GPU cards are assigned to pods that will not be able to capture GPU metrics by EXPORT.

In the following code, we can find the rules are resourceName == nvidiaResourceName or strings.HasPrefix(resourceName, nvidiaMigResourcePrefix) The nvidiaResourceName is "nvidia.com/gpu" This filters the mounting of specific devices. https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L142

func (p *PodMapper) toDeviceToPod(
    devicePods *podresourcesapi.ListPodResourcesResponse, sysInfo SystemInfo,
) map[string]PodInfo {
    deviceToPodMap := make(map[string]PodInfo)

    for _, pod := range devicePods.GetPodResources() {
        for _, container := range pod.GetContainers() {
            for _, device := range container.GetDevices() {

                resourceName := device.GetResourceName()
                if resourceName != nvidiaResourceName {
                    // Mig resources appear differently than GPU resources
                    if !strings.HasPrefix(resourceName, nvidiaMigResourcePrefix) {
                        continue
                    }
                }
                ...
            }
        }
    }

    return deviceToPodMap
}

This appears to be because the DCGM Exporter strictly follows the k8s specification for determining GPU resource. refer k8s device plugin, But it can't cover all scenarios.

The ResourceName it wants to advertise. Here ResourceName needs to follow the extended resource naming scheme as vendor-domain/resourcetype. (For example, an NVIDIA GPU is advertised as nvidia.com/gpu.)

What did you expect to happen?

GPU metrics can be collected when mounting a GPU card using kubevirt passthrough mode

What is the GPU model?

What is the environment?

pod

How did you deploy the dcgm-exporter and what is the configuration?

GPU Operator

How can we reproduce the issue?

Mounting a GPU card using kubevirt passthrough mode.

What is the version?

Latest

Anything else we need to know?

Some discussions in the kubevirt community. https://github.com/kubevirt/kubevirt/issues/11660

nvvfedorov commented 5 months ago

@rokkiter , The dcgm-exporter is dependent on https://github.com/NVIDIA/k8s-device-plugin and uses The pod-resources API to read mapping between pods and devices: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/ .

The Kubevirt support is a new environment for us. Can you give us details on setting up an environment to reproduce the issue?

Also, please explain your use case to justify the feature.

rokkiter commented 4 months ago

Installation environment reference https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.htm

kubevirt configuration GPU reference https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html#add-gpu-resources-to-kubevirt-cr

Using Prometheus, I was able to get monitoring information for the pods(not create by pod) in my environment that use navid.com/gpu resources, but not for the pods(created by kubevirt) that use nvidia.com/GRID_P4-1Q.

environment

kubevirt: v1.0.0
k8s: v1.25.6
gpu-operator: v23.9.0

rokkiter commented 4 months ago

Node configuration supports pass-through mode.

node open IOMMU. refer https://www.server-world.info/en/note?os=CentOS_7&p=kvm&f=10
add label gpu.workload.config=vm-passthrough to node

update gpu-operator config

gpu-operator.sandboxWorkloads.enabled=true
gpu-operator.vfioManager.enabled=true
gpu-operator.sandboxDevicePlugin.enabled=true
gpu-operator.sandboxDevicePlugin.version=v1.2.4   
gpu-operator.toolkit.version=v1.14.3-ubuntu20.04

nvvfedorov commented 4 months ago

@rokkiter , thank you for the update and provided details.

rokkiter commented 4 months ago

Thanks for focusing on this issue. I recently realized that nodes configured for pass-through mode do not install dcgm-exporter. even if I manually hit the node with the nvidia.com/gpu.deploy.dcgm-exporter=true label, this label is automatically removed! Although it doesn't seem possible to monitor kubevirt vm GPU usage at the moment, it would be nice to have a solution to do so!

lx1036 commented 2 months ago

the same question. the nvidiaResourceName should not be hard code nvidia.com/gpu, or the https://github.com/NVIDIA/k8s-device-plugin(I fix and rebuild for rename ResourceName) can advertise ResourceName like nvidia.com/a100, it can't collect deviceToPodMap by ResourceName.

for _, pod := range devicePods.GetPodResources() {
        for _, container := range pod.GetContainers() {
            for _, device := range container.GetDevices() {

                resourceName := device.GetResourceName()
                if resourceName != nvidiaResourceName {
                    // Mig resources appear differently than GPU resources
                    if !strings.HasPrefix(resourceName, nvidiaMigResourcePrefix) {
                        continue
                    }
                }

                podInfo := PodInfo{
                    Name:      pod.GetName(),
                    Namespace: pod.GetNamespace(),
                    Container: container.GetName(),
                }

nvvfedorov commented 2 months ago

@lx1036 , Thank you for your finding. We are accepting PRs;)

lx1036 commented 2 months ago

@nvvfedorov already make the PR https://github.com/NVIDIA/dcgm-exporter/pull/359. thanks.

NVIDIA / dcgm-exporter