DataDog / datadog-operator

Kubernetes Operator for Datadog Resources
Apache License 2.0
300 stars 104 forks source link

Creation and Deletion loop of cluster-agent ClusterRole and ClusterRoleBindings #434

Closed lukedoesinfra closed 2 years ago

lukedoesinfra commented 2 years ago

Describe what happened: When using the below configuration on the DatadogAgent a couple of things happen.

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
  namespace: platform
spec:
  site: datadoghq.eu
  credentials:
    apiKeyExistingSecret: <redacted>
    appKeyExistingSecret: <redacted>
  features:
    logCollection:
      enabled: true
      logsConfigContainerCollectAll: true
  agent:
    image:
      name: datadog/agent:7
    apm:
      enabled: true
    process:
      enabled: true
      processCollectionEnabled: true
    log:
      enabled: true
    systemProbe:
      bpfDebugEnabled: true
    security:
      compliance:
        enabled: true
      runtime:
        enabled: false
  clusterAgent:
    image:
      name: datadog/cluster-agent:1.17.0
    config:
      externalMetrics:
        enabled: true
      admissionController:
        enabled: true

The cluster-agent itself reports a RBAC issue:

subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:platform:datadog-agent-cluster-agent" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope

The operator reports (only in debug) that it's deleting ClusterRole and ClusterRoleBindings conistantly

Example:

 {"level":"DEBUG","ts":"2022-01-27T15:07:16Z","logger":"controllers.DatadogAgent","msg":"deleteClusterRole","datadogagent":"platform/datadog-agent","clusterRole.name":"datadog-agent-cluster-agent-metrics-reader","clusterRole.Namespace":""}

Looking at the events for the DatadogAgent this happened almost 3000 times in 34 minutes.

  Normal  Update ClusterRoleBinding   24m (x2973 over 34m)  DatadogAgent  /datadog-agent-cluster-agent-auth-delegator
  Normal  Delete ClusterRoleBinding   19m (x4471 over 34m)  DatadogAgent  /datadog-agent-cluster-agent-metrics-reader

Describe what you expected: No RBAC errors and the clusterRoleBinding etc to not delete and recreate over and over.

Steps to reproduce the issue: Setting the externalMetrics and admissionController to false negates the issue, so something around them.

  clusterAgent:
    image:
      name: datadog/cluster-agent:1.17.0
    config:
      externalMetrics:
        enabled: false
      admissionController:
        enabled: false        

Additional environment details (Operating System, Cloud provider, etc):

AWS EKS , Using bottlerocket.

lukedoesinfra commented 2 years ago

Just to confirm versions.

I'm using the Operator Helm chart (0.7.8) which is deploying image version 0.7.2.

lukedoesinfra commented 2 years ago

Closing this, as i think this is expected behaviour when the custom metrics server doesn't exist

davidor commented 2 years ago

This PR should fix the issue: https://github.com/DataDog/datadog-operator/pull/441