jz543fm commented 3 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

Please provide a clear description of the problem this feature solves

I've noticed that on some node/s I am receiving error NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus. Due to this, it does not work in Kubernetes then, you need to reset cards with command nvidia-smi -r and kill all what is in use by it (docker/containerd due to nvidia device plugin), so I want to create alert to monitor the status of actual count GPUs cards and if the problem happens it should be helpful to have status in dcgm-exporter metrics that what the actual amount of GPUs works and what amount not, but the fail of GPUs should not affect dcgm and it should report the error state in metrics

I tried to use metric: kube_node_status_allocatable{resource="nvidia_com_gpu"} and this does not work or delta(sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}) by (cluster_name,node_name)[2m:]) < 0 this does not work too, it does not report the error state if the error happens

Feature Description

User Story 1:

As a Kubernetes administrator I want to monitor the status of GPU cards on my nodes So that I can maintain optimal performance and availability of GPU resources within my Kubernetes cluster if the error/s on some GPU/s happened

Requirement Specification: Pre-condition: GPUs are installed and running on Kubernetes nodes, and dcgm-exporter is configured to collect GPU metrics. System: alerting mechanism (e.g., Prometheus alert manager) Shall: generate alerts based on the difference between the expected number of GPUs and the actual number of functioning GPUs Object to be processed: GPU count Condition: alert should be triggered if there is a discrepancy between the expected and actual GPU count, indicating potential GPU failures.

Describe your ideal solution

To effectively manage GPU resources within a Kubernetes cluster and promptly detect GPU failures, the system should be enhanced to accurately monitor GPU status and generate alerts when discrepancies are detected. This involves ensuring that the dcgm-exporter accurately reports the status of each GPU, even if a GPU has fallen off the bus for example, and implementing an alerting mechanism that triggers notifications based on discrepancies between expected and actual GPU counts

Additional context

No response

jz543fm commented 3 months ago

I am able to track XID errors with prometheus query: sum(DCGM_FI_DEV_XID_ERRORS) by (cluster_name,node_name,device,err_msg) > 0

308

jz543fm commented 2 months ago

dcgm-exporter looks like can not report error Xid: 79 GPU has fallen off the bus dcgm-exporter should be GPU fault tolerant and also expose GPU health status as a metric itself

jz543fm commented 2 months ago

@nvvfedorov @glowkey It seems like this issue is being overlooked

glowkey commented 2 months ago

In the next major version, DCGM is adding support for these XID errors that happen when the GPU to falls off the bus. When that support is added to DCGM then they will be available in DCGM-Exporter.

Irene-123 commented 1 month ago

Hi @glowkey I went through the issue and looking to contribute here, though need some time for more clarification and understanding Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :) Thanks!