DataDog / datadog-operator

Kubernetes Operator for Datadog Resources
Apache License 2.0
301 stars 104 forks source link

Operator fails to reconcile when enabling external metrics provider #382

Closed andysnowden closed 3 years ago

andysnowden commented 3 years ago

Describe what happened: Operator failed to reconcile changes to DatadogAgent config to enable external metrics provider.

{"level":"INFO","ts":"2021-09-23T22:21:25Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":"datadog/datadog"}
{"level":"ERROR","ts":"2021-09-23T22:21:25Z","logger":"controller-runtime.manager.controller.datadogagent","msg":"Reconciler error","reconciler group":"datadoghq.com","reconciler kind":"DatadogAgent","name":"datadog","namespace":"datadog","error":"clusterroles.rbac.authorization.k8s.io \"datadog-cluster-agent\" is forbidden: user \"system:serviceaccount:datadog:operator-datadog-operator\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:datadog\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"datadoghq.com\"], Resources:[\"extendeddaemonsetreplicasets\"], Verbs:[\"get\"]}"}

DatadogAgent config:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"datadoghq.com/v1alpha1","kind":"DatadogAgent","metadata":{"annotations":{},"name":"datadog","namespace":"datadog"},"spec":{"agent":{"image":{"name":"gcr.io/datadoghq/agent:latest"}},"clusterAgent":{"image":{"name":"gcr.io/datadoghq/cluster-agent:latest"}},"credentials":{"apiSecret":{"keyName":"api-key","secretName":"datadog-secret"},"appSecret":{"keyName":"app-key","secretName":"datadog-secret"}}}}
  creationTimestamp: "2021-09-23T22:13:12Z"
  finalizers:
  - finalizer.agent.datadoghq.com
  generation: 6
  name: datadog
  namespace: datadog
  resourceVersion: "342519002"
  uid: 81388ecf-6f68-4762-a8d4-68b199afb398
spec:
  agent:
    image:
      name: gcr.io/datadoghq/agent:latest
  clusterAgent:
    config:
      admissionController:
        enabled: true
      externalMetrics:
        enabled: true
    image:
      name: gcr.io/datadoghq/cluster-agent:latest
  clusterChecksRunner: {}
  credentials:
    apiSecret:
      keyName: api-key
      secretName: datadog-secret
    appSecret:
      keyName: app-key
      secretName: datadog-secret
  features: {}
status:
  agent:
    available: 9
    current: 9
    currentHash: 42ce84f28b6184f205126ff7db90f26f
    daemonsetName: datadog-agent
    desired: 9
    lastUpdate: "2021-09-23T22:25:10Z"
    ready: 9
    state: Running
    status: Running (9/9/9)
    upToDate: 9
  clusterAgent:
    availableReplicas: 1
    currentHash: 8ddc4a9732973e98c9158a26d9031924
    deploymentName: datadog-cluster-agent
    generatedToken: QKUWgsARtiJnvteDfXCfnmeiZaKkaQng
    lastUpdate: "2021-09-23T22:13:13Z"
    readyReplicas: 1
    replicas: 1
    state: Running
    status: Running (1/1/1)
    updatedReplicas: 1
  conditions:
  - lastTransitionTime: "2021-09-23T22:25:21Z"
    lastUpdateTime: "2021-09-23T22:25:29Z"
    message: DatadogAgent reconcile error
    status: "False"
    type: Active
  - lastTransitionTime: "2021-09-23T22:13:12Z"
    lastUpdateTime: "2021-09-23T22:25:27Z"
    message: Datadog metrics forwarding ok
    status: "True"
    type: ActiveDatadogMetrics
  - lastTransitionTime: "2021-09-23T22:25:21Z"
    lastUpdateTime: "2021-09-23T22:25:29Z"
    message: |-
      clusterroles.rbac.authorization.k8s.io "datadog-cluster-agent" is forbidden: user "system:serviceaccount:datadog:operator-datadog-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:datadog" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:
      {APIGroups:["datadoghq.com"], Resources:["extendeddaemonsetreplicasets"], Verbs:["get"]}
    status: "True"
    type: ReconcileError
  defaultOverride:
    agent:
      apm:
        enabled: false
      config:
        collectEvents: false
        dogstatsd:
          dogstatsdOriginDetection: false
          unixDomainSocket:
            enabled: false
            hostFilepath: /var/run/datadog/statsd.sock
        healthPort: 5555
        leaderElection: false
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /live
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        logLevel: INFO
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ready
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
      deploymentStrategy:
        canary:
          autoFail:
            enabled: true
            maxRestarts: 5
          autoPause:
            enabled: true
            maxRestarts: 2
          duration: 5m0s
          nodeSelector: {}
          replicas: 1
        reconcileFrequency: 10s
        rollingUpdate:
          maxParallelPodCreation: 250
          maxPodSchedulerFailure: 10%
          maxUnavailable: 10%
          slowStartAdditiveIncrease: "5"
          slowStartIntervalDuration: 1m0s
        updateStrategyType: RollingUpdate
      enabled: true
      image:
        pullPolicy: IfNotPresent
      networkPolicy:
        create: false
      process:
        enabled: false
      rbac:
        create: true
      security:
        compliance:
          enabled: false
        runtime:
          enabled: false
          syscallMonitor:
            enabled: false
      systemProbe:
        enabled: false
      useExtendedDaemonset: false
    clusterAgent:
      config:
        admissionController:
          mutateUnlabelled: false
          serviceName: datadog-admission-controller
        clusterChecksEnabled: false
        collectEvents: false
        externalMetrics:
          port: 8443
        healthPort: 5555
        logLevel: INFO
      enabled: true
      image:
        pullPolicy: IfNotPresent
      networkPolicy:
        create: false
      rbac:
        create: true
    clusterChecksRunner:
      enabled: false
    credentials:
      useSecretBackend: false
    features:
      kubeStateMetricsCore:
        clusterCheck: false
        enabled: false
      logCollection:
        enabled: false
      networkMonitoring:
        enabled: false
      orchestratorExplorer:
        clusterCheck: false
        enabled: true
        scrubbing:
          containers: true
      prometheusScrape:
        enabled: false

Looks like the rbac is missing a scope.

Describe what you expected: Operator should accept changes, create new apiservice and enable external metrics

Steps to reproduce the issue:

  1. Install datadog operator helm install operator datadog/datadog-operator -n datadog
  2. Deploy example config from here
  3. Edit config and add yaml snippet for external metrics Example
  4. Get Error :(

Additional environment details (Operating System, Cloud provider, etc): K8S: 1.21.4 Operator: {"level":"INFO","ts":"2021-09-23T22:32:33Z","logger":"setup","msg":"Version: v0.7.0"}

vboulineau commented 3 years ago

Hello @andysnowden,

Yes it's an issue we've already fixed: https://github.com/DataDog/datadog-operator/pull/379, it will be released in 0.7.1. The issue comes from the admissionController feature

clamoriniere commented 3 years ago

Hi @andysnowden,

we have released the operator chart 0.7.1 that contains the missing RBAC for the operator. Could you please upgrade with the new the chart version and try again to deploy the DatadogAgent resource. 🙇