NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
866 stars 153 forks source link

GPU Failure Detection and Alerting Enhancement #348

Open jz543fm opened 3 months ago

jz543fm commented 3 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

Please provide a clear description of the problem this feature solves

I've noticed that on some node/s I am receiving error NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus. Due to this, it does not work in Kubernetes then, you need to reset cards with command nvidia-smi -r and kill all what is in use by it (docker/containerd due to nvidia device plugin), so I want to create alert to monitor the status of actual count GPUs cards and if the problem happens it should be helpful to have status in dcgm-exporter metrics that what the actual amount of GPUs works and what amount not, but the fail of GPUs should not affect dcgm and it should report the error state in metrics

I tried to use metric: kube_node_status_allocatable{resource="nvidia_com_gpu"} and this does not work or delta(sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}) by (cluster_name,node_name)[2m:]) < 0 this does not work too, it does not report the error state if the error happens

Feature Description

User Story 1:

As a Kubernetes administrator I want to monitor the status of GPU cards on my nodes So that I can maintain optimal performance and availability of GPU resources within my Kubernetes cluster if the error/s on some GPU/s happened

Requirement Specification: Pre-condition: GPUs are installed and running on Kubernetes nodes, and dcgm-exporter is configured to collect GPU metrics. System: alerting mechanism (e.g., Prometheus alert manager) Shall: generate alerts based on the difference between the expected number of GPUs and the actual number of functioning GPUs Object to be processed: GPU count Condition: alert should be triggered if there is a discrepancy between the expected and actual GPU count, indicating potential GPU failures.

Describe your ideal solution

To effectively manage GPU resources within a Kubernetes cluster and promptly detect GPU failures, the system should be enhanced to accurately monitor GPU status and generate alerts when discrepancies are detected. This involves ensuring that the dcgm-exporter accurately reports the status of each GPU, even if a GPU has fallen off the bus for example, and implementing an alerting mechanism that triggers notifications based on discrepancies between expected and actual GPU counts

Additional context

No response

jz543fm commented 3 months ago

I am able to track XID errors with prometheus query: sum(DCGM_FI_DEV_XID_ERRORS) by (cluster_name,node_name,device,err_msg) > 0

308

jz543fm commented 2 months ago

dcgm-exporter looks like can not report error Xid: 79 GPU has fallen off the bus dcgm-exporter should be GPU fault tolerant and also expose GPU health status as a metric itself

jz543fm commented 2 months ago

@nvvfedorov @glowkey It seems like this issue is being overlooked

glowkey commented 2 months ago

In the next major version, DCGM is adding support for these XID errors that happen when the GPU to falls off the bus. When that support is added to DCGM then they will be available in DCGM-Exporter.

Irene-123 commented 1 month ago

Hi @glowkey I went through the issue and looking to contribute here, though need some time for more clarification and understanding Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :) Thanks!

jz543fm commented 1 month ago

I do not understand why to duplicate this issue

jz543fm commented 1 month ago

@glowkey The XID error 79 affects the dcgm-exporter pod like that, that is still running and some metrics are missing also it looks like ok, so the metric that represents health status will be also useful, due to this problem, when I tried to manually the running pod kill and the scheduler scheduled the new pod, the new pod could not be created and then it was not running and due to this it could not export metrics

Irene-123 commented 1 month ago

@jz543fm which is the original issue for this? Is there a discord channel for discussion? I was hoping for some help to make my first contribution here

jz543fm commented 1 month ago

This is the original issue, issue mentioned above (308) is already closed due to latest release of dcgm-exporter that can detect XID errors as I mentioned in comment in the issue, but that issue does not cover this issues that are already known and that's why I do not understand why to duplicate it when this issue already defines the problematics behind it and it is only one original issue nowadays for it

No there is no discord channel to discuss it

jz543fm commented 1 month ago

@glowkey Also I've detected missing metrics during this error, there are no DCP metrics so at least this is the possible situation that you can create alert on it, there is metric change but not every card generates the same amount of metrics

jz543fm commented 1 month ago

It is possible to create alert that metrics are missing like

absent(DCGM_FI_PROF_GR_ENGINE_ACTIVE{})

to detect the error XID 79

glowkey commented 1 month ago

That's great if it works for your setup. Not all GPUs support that metric but it can work in some cases.

jz543fm commented 3 weeks ago

Release 3.3.6-3.4.2 can catch the error 79 as I can see, but when I tried to check that, versus 3.3.5-3.4.1, the 3.3.5-3.4.1 can not detect the XID errors, but when it happens the some DCP metrics are missing, whereas in the 3.3.6-3.4.2 metrics are not missing, but in the 2 releases the error happens and the dcgm-exporter exports metrics like nothing happened, one release contains missing metrics and second not, I suppose that there should be metrics like health status or when XID 79 error happens health status of the dcgm-exporter on proper node should be 0 :D that would indicate something is wrong

ducnt102 commented 2 weeks ago

If a GPU failed, it could not expose some metrics. This is my alert for this.

(count by (instance) (DCGM_FI_DEV_SM_CLOCK) < 8 or count by (instance) (DCGM_FI_DEV_MEM_CLOCK) < 8 or count by (instance) (DCGM_FI_DEV_MEMORY_TEMP) < 8 or count by (instance) (DCGM_FI_DEV_GPU_TEMP) < 8 or count by (instance) (DCGM_FI_DEV_POWER_USAGE) < 8 or count by (instance) (DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) < 8 or count by (instance) (DCGM_FI_DEV_PCIE_REPLAY_COUNTER) < 8 or count by (instance) (DCGM_FI_DEV_GPU_UTIL) < 8 or count by (instance) (DCGM_FI_DEV_MEM_COPY_UTIL) < 8 or count by (instance) (DCGM_FI_DEV_ENC_UTIL) < 8 or count by (instance) (DCGM_FI_DEV_DEC_UTIL) < 8)