Open ThisIsQasim opened 7 months ago
This appears to have been reported repeatedly #151 #201 #222
The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.
there is only pod per node trying to read the metrics but multiple pods using the same GPU. The issue is that dcgm exporter should report metrics for all the pods using the GPU.
@ThisIsQasim, Can you share how you request GPU resources for pods?
Sure. A single GPU is advertised as multiple using the nvidia device plugin
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
and then GPUs are requested with the regular resource requests
esources:
requests:
cpu: 3600m
limits:
memory: 13000Mi
nvidia.com/gpu: "1"
@ThisIsQasim , And you use the gpu operator?
I do not. It’s manually deployed.
Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.
Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.
Is there a known root-cause for this issue?
From what I've dug up:
Pods using timesliced GPUs append a -<idx>
to the end of their deviceIDs, like so:
&ContainerDevices{ResourceName:nvidia.com/gpu,DeviceIds:[GPU-51424525-5928-4e4c-2503-8ca3bca0b134-2],}
Thanks @larry-lu-lu (https://github.com/NVIDIA/dcgm-exporter/issues/201#issuecomment-1825284066).
Therefore when the deviceToPodMap is updated here, none of the pods using the GPU are associated with the base deviceID. Execution then reaches this loop and, because none of the pods in deviceToPod are associated with the baseID, dcgm-exporter totally skips the pod/namespace label and moves on.
Unfortunately there doesn't seem to be a quick fix. As far as I understand, the DCGM metrics we collect are associated with exactly one UUID. This is OK for MIGs because they will each have a unique UUID. But metrics on time-sliced GPUs will, if I'm not mistaken, have the UUID of the base device, without an index attached.
@nikkon-dev and others, forgive me for pinging, I would really like to know if my understanding is correct here.
I understand that meanwhile it will be the same for mps new support in device plugin - per- pod metrics will not be shown is it correct? another quest, if we have pods that are not requesting gpu through device plugin but are able to use GPU due to some tricks (mounts etc.) can they be reported to dcgm when they use GPU?
What is the version?
3.3.5-3.4.1
What happened?
Metrics like
DCGM_FI_PROF_GR_ENGINE_ACTIVE
are only exposed for one single pod even though there are multiple pods that use the same GPUWhat did you expect to happen?
Metrics for the all the pods should be exposed
What is the GPU model?
Tesla T4
What is the environment?
GKE
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
Anything else we need to know?
From the debug log