NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
911 stars 157 forks source link

Per pod metrics not exposed with time-slicing enabled #307

Open ThisIsQasim opened 7 months ago

ThisIsQasim commented 7 months ago

What is the version?

3.3.5-3.4.1

What happened?

Metrics like DCGM_FI_PROF_GR_ENGINE_ACTIVE are only exposed for one single pod even though there are multiple pods that use the same GPU

What did you expect to happen?

Metrics for the all the pods should be exposed

What is the GPU model?

Tesla T4

What is the environment?

GKE

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

Anything else we need to know?

From the debug log

time="2024-04-05T13:49:04Z" level=debug msg="Device to pod mapping: map[nvidia0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu1:{Name:gpu-pod-c69f6664f-2v922 Namespace:default Container:extractor} nvidia0/vgpu2:{Name:gpu-pod-c69f6664f-wrcxw Namespace:default Container:extractor} nvidia0/vgpu3:{Name:gpu-pod-c69f6664f-ffs8r Namespace:default Container:extractor}]"
ThisIsQasim commented 7 months ago

This appears to have been reported repeatedly #151 #201 #222

nvvfedorov commented 7 months ago

The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.

ThisIsQasim commented 7 months ago

there is only pod per node trying to read the metrics but multiple pods using the same GPU. The issue is that dcgm exporter should report metrics for all the pods using the GPU.

nvvfedorov commented 7 months ago

@ThisIsQasim, Can you share how you request GPU resources for pods?

ThisIsQasim commented 7 months ago

Sure. A single GPU is advertised as multiple using the nvidia device plugin

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

and then GPUs are requested with the regular resource requests

esources:
  requests:
    cpu: 3600m
  limits:
    memory: 13000Mi
    nvidia.com/gpu: "1"
nvvfedorov commented 7 months ago

@ThisIsQasim , And you use the gpu operator?

ThisIsQasim commented 7 months ago

I do not. It’s manually deployed.

nvvfedorov commented 7 months ago

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

svetly-todorov commented 5 months ago

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

Is there a known root-cause for this issue?


From what I've dug up:

Pods using timesliced GPUs append a -<idx> to the end of their deviceIDs, like so:

&ContainerDevices{ResourceName:nvidia.com/gpu,DeviceIds:[GPU-51424525-5928-4e4c-2503-8ca3bca0b134-2],}

Thanks @larry-lu-lu (https://github.com/NVIDIA/dcgm-exporter/issues/201#issuecomment-1825284066).

Therefore when the deviceToPodMap is updated here, none of the pods using the GPU are associated with the base deviceID. Execution then reaches this loop and, because none of the pods in deviceToPod are associated with the baseID, dcgm-exporter totally skips the pod/namespace label and moves on.

Unfortunately there doesn't seem to be a quick fix. As far as I understand, the DCGM metrics we collect are associated with exactly one UUID. This is OK for MIGs because they will each have a unique UUID. But metrics on time-sliced GPUs will, if I'm not mistaken, have the UUID of the base device, without an index attached.

@nikkon-dev and others, forgive me for pinging, I would really like to know if my understanding is correct here.

ettelr commented 5 months ago

I understand that meanwhile it will be the same for mps new support in device plugin - per- pod metrics will not be shown is it correct? another quest, if we have pods that are not requesting gpu through device plugin but are able to use GPU due to some tricks (mounts etc.) can they be reported to dcgm when they use GPU?