DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
914 stars 1.39k forks source link

Karpenter integration incompatible with Karpenter >= 1.0.0 #18367

Open JacobHenner opened 4 weeks ago

JacobHenner commented 4 weeks ago

Karpenter's 1.0.0 release renames several metrics. After upgrading to 1.0.0, new data points for the previously reported metrics are no longer accessible in Datadog.

Steps to reproduce the issue:

  1. Upgrade Karpenter from 0.x.y to >= 1.0.0
  2. View Karpenter metrics in Datadog

Describe the results you received:

Several metrics are no longer reported

Describe the results you expected:

Metrics continue to report (or continue to report following a datadog-agent upgrade)

Additional information you deem important (e.g. issue happens only occasionally):

I can submit a PR to modify the integration, but I am not sure if there's an existing convention for renaming both the input and output metric names, or just the input (to maintain continuity with pre-existing monitors, dashboards, etc). I'll gladly submit a PR once guidance is provided.

JacobHenner commented 1 week ago

For spectators: I'm told that #18448 is expected to be included in datadog-agent 7.58. In the meantime, you can continue to ingest metrics from Karpenter>=1.0.0 using the following configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: karpenter
  namespace: kube-system
spec:
  template:
    metadata:
      annotations:
        ad.datadoghq.com/controller.checks: |
          {
            "karpenter": {
              "init_config": {},
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "extra_metrics": [
                    {
                      "karpenter_nodes_termination_duration_seconds": "nodes.termination.time_seconds"
                    },
                    {
                      "karpenter_pods_startup_duration_seconds": "pods.startup.time_seconds"
                    },
                    {
                      "karpenter_voluntary_disruption_queue_failures": "disruption.replacement.nodeclaim.failures"
                    },
                    {
                      "karpenter_voluntary_disruption_decision_evaluation_duration_seconds": "disruption.evaluation.duration_seconds"
                    },
                    {
                      "karpenter_voluntary_disruption_eligible_nodes": "disruption.eligible_nodes"
                    },
                    {
                      "karpenter_voluntary_disruption_consolidation_timeouts": "disruption.consolidation_timeouts"
                    },
                    {
                      "karpenter_nodepools_allowed_disruptions": "disruption.budgets.allowed_disruptions"
                    },
                    {
                      "karpenter_voluntary_disruption_decisions": "disruption.actions_performed"
                    },
                    {
                      "karpenter_scheduler_scheduling_duration_seconds": "provisioner.scheduling.simulation.duration_seconds"
                    },
                    {
                      "karpenter_scheduler_queue_depth": "provisioner.scheduling.queue_depth"
                    },
                    {
                      "karpenter_interruption_message_queue_duration_seconds": "interruption.message.latency.time_seconds"
                    },
                    {
                      "karpenter_nodepools_usage": "nodepool_usage"
                    },
                    {
                      "karpenter_nodepools_limit": "nodepool_limit"
                    }
                  ]
                }
              ]
            }
          }
aelliottatsonatype commented 5 days ago

If I'm using the helm chart, where does this code go? Is it under the agents section of the chart?

So far I have not been able to get this working.

visokoo commented 3 days ago

If I'm using the helm chart, where does this code go? Is it under the agents section of the chart?

So far I have not been able to get this working.

It goes under podAnnotations, like:

podAnnotations:
  ad.datadoghq.com/controller.checks: |
    {
      "karpenter": {
        "init_config": {},
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:%%port_1%%/metrics",
            "extra_metrics": [
              {
                "karpenter_nodes_termination_duration_seconds": "nodes.termination.time_seconds"
              },
              {
                "karpenter_pods_startup_duration_seconds": "pods.startup.time_seconds"
              },
...