Open Dentrax opened 9 months ago
@cdesiniotis @Dentrax I'm interested in this feature If you don't mind, can I write a PR?
@changhyuni We also have to consider cases where the users may disable dcgm and dcgm-exporter in their gpu-operator deployments. So we wouldn't always be able to resort to this validation.
However, we have also begun looking into integrating gpu health monitoring components into gpu-operator. We are still in our exploratory phases, but we can keep you posted as we come out with more updates
Feature Description
AFAIUC,
validator
app currently only validates the driver installation.It would be great to have additional validating steps for the DCGM installation. It can be enabled with an optional flag, having a healthy DCGM is important to export metrics via
dcgm-exporter
.Idea is to similar to this cookbook: https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh - cc @shivamerla for awareness
It should be run for each GPU:
nvidia-smi -L