NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 298 forks source link

How to query the validation result using api? #787

Open chenditc opened 4 months ago

chenditc commented 4 months ago

1. Quick Debug Information

2. Issue or feature description

I can see the nvidia-operator-validator finishes its init container and completes validation by checking the code and the status. But the result seems not reported to anywhere, like node label.

I have a operator to deploy machine learning model, I want to check the gpu validation result before I deploy the model, so that I can ensure there is sufficient hardware instead of hanging on "FailedToSchedule" state or failed at runtime.

Is there a way to query the validation result? I am thinking using nodel label or some api like /metrics.

cdesiniotis commented 4 months ago

@chenditc one option is to enable the nvidia-node-status-exporter pod by enabling it in Clusterpolicy nodeStatusExporter.enabled=true. This component exposes the following metrics: https://github.com/NVIDIA/gpu-operator/blob/main/validator/metrics.go#L50-L68