Open chenditc opened 4 months ago
@chenditc one option is to enable the nvidia-node-status-exporter
pod by enabling it in Clusterpolicy nodeStatusExporter.enabled=true
. This component exposes the following metrics: https://github.com/NVIDIA/gpu-operator/blob/main/validator/metrics.go#L50-L68
1. Quick Debug Information
2. Issue or feature description
I can see the nvidia-operator-validator finishes its init container and completes validation by checking the code and the status. But the result seems not reported to anywhere, like node label.
I have a operator to deploy machine learning model, I want to check the gpu validation result before I deploy the model, so that I can ensure there is sufficient hardware instead of hanging on "FailedToSchedule" state or failed at runtime.
Is there a way to query the validation result? I am thinking using nodel label or some api like /metrics.