DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.84k stars 1.19k forks source link

can't get custom metrics for hpa from datadog #8941

Closed yishaihl closed 3 years ago

yishaihl commented 3 years ago

hey guys i’m trying to setup datadog as custom metric for my kubernetes hpa. the problem is that from the documentation (https://docs.datadoghq.com/agent/cluster_agent/external_metrics/?tab=daemonset#set-up-the-cluster-agent-external-metric-server) first of all i'm trying to understand if ‘Kubernetes aggregation layer’ is a prequisite when i’m running on EKS 1.18 or it’s enabled by default?

This is what i get from the HPA after setting the following configuration:

Screen Shot 2021-08-19 at 18 40 39

horizontal-pod-autoscaler  unable to get external metric canary/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_app_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: **the server is currently unable to handle the request** (get nginx.net.request_per_s.external.metrics.k8s.io)

This is the errors i'm getting inside the cluster-agent:

datadog-cluster-agent-585897dc8d-x8l82 cluster-agent 2021-08-20 06:46:14 UTC | CLUSTER | ERROR | (pkg/clusteragent/externalmetrics/metrics_retriever.go:77 in retrieveMetricsValues) | Unable to fetch external metrics: [Error while executing metric query avg:nginx.net.request_per_s{kubea_app_name:ingress-nginx}.rollup(30): API error 403 Forbidden: {"status":********@datadoghq.com"}, strconv.Atoi: parsing "": invalid syntax]
# datadog-cluster-agent status
Getting the status from the agent.
2021-08-19 15:28:21 UTC | CLUSTER | WARN | (pkg/util/log/log.go:541 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
===============================
Datadog Cluster Agent (v1.10.0)
===============================

  Status date: 2021-08-19 15:28:21.519850 UTC
  Agent start: 2021-08-19 12:11:44.266244 UTC
  Pid: 1
  Go Version: go1.14.12
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2021-08-19 15:28:21.519850 UTC

  Hostnames
  =========
    ec2-hostname: ip-10-30-162-8.eu-west-1.compute.internal
    hostname: i-00d0458844a597dec
    instance-id: i-00d0458844a597dec
    socket-fqdn: datadog-cluster-agent-585897dc8d-x8l82
    socket-hostname: datadog-cluster-agent-585897dc8d-x8l82
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-585897dc8d-x8l82
  Last Acquisition of the lease: Thu, 19 Aug 2021 12:13:14 UTC
  Renewed leadership: Thu, 19 Aug 2021 15:28:07 UTC
  Number of leader transitions: 17 transitions

Custom Metrics Server
=====================
  External metrics provider uses DatadogMetric - Check status directly from Kubernetes with: `kubectl get datadogmetric`

Admission Controller
====================
  Disabled: The admission controller is not enabled on the Cluster Agent

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 787
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 660
      Service Checks: Last Run: 3, Total: 2,343
      Average Execution Time : 1.898s
      Last Execution Date : 2021-08-19 15:28:17.000000 UTC
      Last Successful Execution Date : 2021-08-19 15:28:17.000000 UTC

=========
Forwarder
=========

  Transactions
  ============
    Deployments: 350
    Dropped: 0
    DroppedOnInput: 0
    Nodes: 497
    Pods: 3
    ReplicaSets: 576
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Services: 263

  Transaction Successes
  =====================
    Total number: 3442
    Successes By Endpoint:
      check_run_v1: 786
      intake: 181
      orchestrator: 1,689
      series_v1: 786

==========
Endpoints
==========
  https://app.datadoghq.eu - API Key ending with:
      - f295b

=====================
Orchestrator Explorer
=====================
  ClusterID: f7b4f97a-3cf2-11ea-aaa8-0a158f39909c
  ClusterName: production
  ContainerScrubbing: Enabled
  ======================
  Orchestrator Endpoints
  ======================

  ===============
  Forwarder Stats
  ===============
    Pods: 3
    Deployments: 350
    ReplicaSets: 576
    Services: 263
    Nodes: 497

  ===========
  Cache Stats
  ===========
    Elements in the cache: 393
    Pods:
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 7 Miss: 5)
    Deployments:
      Last Run: (Hits: 36 Miss: 1) | Total: (Hits: 40846 Miss: 2444)
    ReplicaSets:
      Last Run: (Hits: 297 Miss: 1) | Total: (Hits: 328997 Miss: 19441)
    Services:
      Last Run: (Hits: 44 Miss: 0) | Total: (Hits: 49520 Miss: 2919)
    Nodes:
      Last Run: (Hits: 9 Miss: 0) | Total: (Hits: 10171 Miss: 755)

and this is what i get from datadogmetric:

Name:         dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
Namespace:    datadog
Labels:       <none>
Annotations:  <none>
API Version:  datadoghq.com/v1alpha1
Kind:         DatadogMetric
Metadata:
  Creation Timestamp:  2021-08-19T15:14:14Z
  Generation:          1
  Managed Fields:
    API Version:  datadoghq.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
      f:status:
        .:
        f:autoscalerReferences:
        f:conditions:
          .:
          k:{"type":"Active"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:status:
            f:type:
          k:{"type":"Error"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:message:
            f:reason:
            f:status:
            f:type:
          k:{"type":"Updated"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:status:
            f:type:
          k:{"type":"Valid"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:status:
            f:type:
        f:currentValue:
    Manager:         datadog-cluster-agent
    Operation:       Update
    Time:            2021-08-19T15:14:44Z
  Resource Version:  164942235
  Self Link:         /apis/datadoghq.com/v1alpha1/namespaces/datadog/datadogmetrics/dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
  UID:               6e9919eb-19ca-4131-b079-4a8a9ac577bb
Spec:
  External Metric Name:  nginx.net.request_per_s
  Query:                 avg:nginx.net.request_per_s{kube_app_name:nginx}.rollup(30)
Status:
  Autoscaler References:  canary/hibob-hpa
  Conditions:
    Last Transition Time:  2021-08-19T15:14:14Z
    Last Update Time:      2021-08-19T15:53:14Z
    Status:                True
    Type:                  Active
    Last Transition Time:  2021-08-19T15:14:14Z
    Last Update Time:      2021-08-19T15:53:14Z
    Status:                False
    Type:                  Valid
    Last Transition Time:  2021-08-19T15:14:14Z
    Last Update Time:      2021-08-19T15:53:14Z
    Status:                True
    Type:                  Updated
    Last Transition Time:  2021-08-19T15:14:44Z
    Last Update Time:      2021-08-19T15:53:14Z
    Message:               Global error (all queries) from backend
    Reason:                Unable to fetch data from Datadog
    Status:                True
    Type:                  Error
  Current Value:           0
Events:                    <none>

this is my cluster agent deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "18"
    meta.helm.sh/release-name: datadog
    meta.helm.sh/release-namespace: datadog
  creationTimestamp: "2021-02-05T07:36:39Z"
  generation: 18
  labels:
    app.kubernetes.io/instance: datadog
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: datadog
    app.kubernetes.io/version: "7"
    helm.sh/chart: datadog-2.7.0
  name: datadog-cluster-agent
  namespace: datadog
  resourceVersion: "164881216"
  selfLink: /apis/apps/v1/namespaces/datadog/deployments/datadog-cluster-agent
  uid: ec52bb4b-62af-4007-9bab-d5d16c48e02c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: datadog-cluster-agent
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        ad.datadoghq.com/cluster-agent.check_names: '["prometheus"]'
        ad.datadoghq.com/cluster-agent.init_configs: '[{}]'
        ad.datadoghq.com/cluster-agent.instances: |
          [{
            "prometheus_url": "http://%%host%%:5000/metrics",
            "namespace": "datadog.cluster_agent",
            "metrics": [
              "go_goroutines", "go_memstats_*", "process_*",
              "api_requests",
              "datadog_requests", "external_metrics", "rate_limit_queries_*",
              "cluster_checks_*"
            ]
          }]
        checksum/api_key: something
        checksum/application_key: something
        checksum/clusteragent_token: something
        checksum/install_info: something
      creationTimestamp: null
      labels:
        app: datadog-cluster-agent
      name: datadog-cluster-agent
    spec:
      containers:
      - env:
        - name: DD_HEALTH_PORT
          value: "5555"
        - name: DD_API_KEY
          valueFrom:
            secretKeyRef:
              key: api-key
              name: datadog
              optional: true
        - name: DD_APP_KEY
          valueFrom:
            secretKeyRef:
              key: app-key
              name: datadog-appkey
        - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
          value: "true"
        - name: DD_EXTERNAL_METRICS_PROVIDER_PORT
          value: "8443"
        - name: DD_EXTERNAL_METRICS_PROVIDER_WPA_CONTROLLER
          value: "false"
        - name: DD_EXTERNAL_METRICS_PROVIDER_USE_DATADOGMETRIC_CRD
          value: "true"
        - name: DD_EXTERNAL_METRICS_AGGREGATOR
          value: avg
        - name: DD_CLUSTER_NAME
          value: production
        - name: DD_SITE
          value: datadoghq.eu
        - name: DD_LOG_LEVEL
          value: INFO
        - name: DD_LEADER_ELECTION
          value: "true"
        - name: DD_COLLECT_KUBERNETES_EVENTS
          value: "true"
        - name: DD_CLUSTER_AGENT_KUBERNETES_SERVICE_NAME
          value: datadog-cluster-agent
        - name: DD_CLUSTER_AGENT_AUTH_TOKEN
          valueFrom:
            secretKeyRef:
              key: token
              name: datadog-cluster-agent
        - name: DD_KUBE_RESOURCES_NAMESPACE
          value: datadog
        - name: DD_ORCHESTRATOR_EXPLORER_ENABLED
          value: "true"
        - name: DD_ORCHESTRATOR_EXPLORER_CONTAINER_SCRUBBING_ENABLED
          value: "true"
        - name: DD_COMPLIANCE_CONFIG_ENABLED
          value: "false"
        image: gcr.io/datadoghq/cluster-agent:1.10.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /live
            port: 5555
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        name: cluster-agent
        ports:
        - containerPort: 5005
          name: agentport
          protocol: TCP
        - containerPort: 8443
          name: metricsapi
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ready
            port: 5555
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/datadog-agent/install_info
          name: installinfo
          readOnly: true
          subPath: install_info
      dnsConfig:
        options:
        - name: ndots
          value: "3"
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: datadog-cluster-agent
      serviceAccountName: datadog-cluster-agent
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: datadog-installinfo
        name: installinfo
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-05-13T15:46:33Z"
    lastUpdateTime: "2021-05-13T15:46:33Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-02-05T07:36:39Z"
    lastUpdateTime: "2021-08-19T12:12:06Z"
    message: ReplicaSet "datadog-cluster-agent-585897dc8d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 18
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Describe what happened: i'm trying to setup HPA using datadog custom metric using the official guide: [https://docs.datadoghq.com/agent/cluster_agent/external_metrics/?tab=helm] and for some reason the HPA can't grab the metrics.

Additional environment details (Operating System, Cloud provider, etc): EKS 1.18 Datadog Cluster Agent (v1.10.0)

yishaihl commented 3 years ago

@Simwar hey how are you? I saw you kinda master here.. any chance you can help me out here or try to direct me please?

abhishek17feb commented 1 week ago

Hey I am facing similar issue while fetching the custom metric for HPA. Getting this error

Error from server (Forbidden): error when retrieving current configuration of: Resource: "datadoghq.com/v1alpha1, Resource=datadogmetrics", GroupVersionKind: "datadoghq.com/v1alpha1, Kind=DatadogMetric" Name: "nginx-requests", Namespace: "iescapitalcloud-saas-abhishek" from server for: ".\datadog_metric_config.yml": datadogmetrics.datadoghq.com "nginx-requests" is forbidden: User "u-hbo6ghc7dw" cannot get resource "datadogmetrics" in API group "datadoghq.com" in the namespace "iescapitalcloud-saas-abhishek"