NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 291 forks source link

Allow annotations to be added to just to the nvidia-dcgm-node-exporter daemonset for datadog monitoring via helm install #681

Open flowinh2o opened 7 months ago

flowinh2o commented 7 months ago

Would it be possible to add dcgmExporter.annotations to the helm chart? We are using Datadog to monitor our clusters and seems like the autodiscovery agent (v7.51.0) has a problem with all of the daemonsets having the same annotations as seen below:

Thank you!

  Configuration Errors
  ====================
    gpu-operator/gpu-feature-discovery-nrsl7 (d57f6d2c-e8f2-48e7-9989-4f795acf9b10)
    -------------------------------------------------------------------------------
        annotation ad.datadoghq.com/nvidia-dcgm-exporter.checks is invalid: nvidia-dcgm-exporter doesn't match a container identifier [gpu-feature-discovery toolkit-validation]
    gpu-operator/nvidia-container-toolkit-daemonset-q8f25 (53c136bf-e3ed-4dca-9cce-87f0830312fb)
    --------------------------------------------------------------------------------------------
        annotation ad.datadoghq.com/nvidia-dcgm-exporter.checks is invalid: nvidia-dcgm-exporter doesn't match a container identifier [driver-validation nvidia-container-toolkit-ctr]
    gpu-operator/nvidia-device-plugin-daemonset-9jp9f (3b692b0d-55e9-4ca4-a125-949f019c3618)
    ----------------------------------------------------------------------------------------
        annotation ad.datadoghq.com/nvidia-dcgm-exporter.checks is invalid: nvidia-dcgm-exporter doesn't match a container identifier [nvidia-device-plugin toolkit-validation]
    gpu-operator/nvidia-driver-daemonset-rltzp (91ba4b61-6a1f-4135-a63e-44995fb7acfd)
    ---------------------------------------------------------------------------------
        annotation ad.datadoghq.com/nvidia-dcgm-exporter.checks is invalid: nvidia-dcgm-exporter doesn't match a container identifier [k8s-driver-manager mofed-validation nvidia-driver-ctr nvidia-peermem-ctr]
    gpu-operator/nvidia-mig-manager-5tvkk (ef6705e1-2a9d-4ea8-a23c-e9726e644fb0)
    ----------------------------------------------------------------------------
        annotation ad.datadoghq.com/nvidia-dcgm-exporter.checks is invalid: nvidia-dcgm-exporter doesn't match a container identifier [nvidia-mig-manager toolkit-validation]
    gpu-operator/nvidia-operator-validator-642wm (b72e3d8b-b633-4617-bb62-5ab05585935b)
    -----------------------------------------------------------------------------------
        annotation ad.datadoghq.com/nvidia-dcgm-exporter.checks is invalid: nvidia-dcgm-exporter doesn't match a container identifier [cuda-validation driver-validation nvidia-operator-validator plugin-validation toolkit-validation]
flowinh2o commented 7 months ago

Actually it looks like the integration works so this is not really needed and would only eliminate the errors seen in the agent above.

shashiranjan84 commented 7 months ago

@flowinh2o I am getting same error and I dont any metrics in DD. How you managed to see the metrics?

shashiranjan84 commented 7 months ago

Got the metrics working once I fixed the annotation. But agree @flowinh2o, we need dcgmExporter specific annotation

changhyuni commented 6 days ago

@shashiranjan84 Is it possible to separate comments by container? I need to use datadog's openmetrics...