DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
326 stars 1.01k forks source link

[datadog-operator]Missing permissions "system:serviceaccount:datadog:datadog-cluster-agent" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope #1422

Open eldarmus opened 3 weeks ago

eldarmus commented 3 weeks ago

Describe what happened: We observed these logs in datadog-cluster-agent pod

2024-06-14 11:18:32 UTC | CLUSTER | ERROR | (apimachinery@v0.28.6/pkg/util/runtime/runtime.go:115 in logError) | pkg/mod/k8s.io/client-go@v0.28.6/tools/cache/reflector.go:229: Failed to watch *v1.CustomResourceDefinition: failed to list *v1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:datadog:datadog-cluster-agent" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope 2024-06-14 11:18:36 UTC | CLUSTER | WARN | (pkg/collector/corechecks/cluster/orchestrator/collector_bundle.go:332 in Run) | check:orchestrator | Collector apiextensions.k8s.io/v1/customresourcedefinitions is skipped: couldn't sync informer apiextensions.k8s.io/v1/customresourcedefinitions in 1m5.000980622s 2024-06-14 11:18:46 UTC | CLUSTER | WARN | (pkg/collector/corechecks/cluster/orchestrator/collector_bundle.go:332 in Run) | check:orchestrator | Collector apiextensions.k8s.io/v1/customresourcedefinitions is skipped: couldn't sync informer apiextensions.k8s.io/v1/customresourcedefinitions in 1m5.000980622s 2024-06-14 11:18:56 UTC | CLUSTER | WARN | (pkg/collector/corechecks/cluster/orchestrator/collector_bundle.go:332 in Run) | check:orchestrator | Collector apiextensions.k8s.io/v1/customresourcedefinitions is skipped: couldn't sync informer apiextensions.k8s.io/v1/customresourcedefinitions in 1m5.000980622s

here is the definition of datadog-cluster-agent ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: "2024-06-14T10:51:38Z"
  labels:
    app.kubernetes.io/instance: datadog
    app.kubernetes.io/managed-by: datadog-operator
    app.kubernetes.io/name: datadog-agent-deployment
    app.kubernetes.io/part-of: datadog-datadog
    app.kubernetes.io/version: ""
    operator.datadoghq.com/managed-by-store: "true"
  name: datadog-cluster-agent
  resourceVersion: "44591376"
  uid: 9b545e67-5f08-4c9e-9983-c910c1b5fbcb
rules:
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  verbs:
  - get
  - list
  - watch
  - create
  - update
- apiGroups:
  - datadoghq.com
  resources:
  - extendeddaemonsetreplicasets
  verbs:
  - get
- apiGroups:
  - apps
  resources:
  - deployments
  - replicasets
  - statefulsets
  - daemonsets
  verbs:
  - get
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - list
  - watch
  - get
- apiGroups:
  - batch
  resources:
  - cronjobs
  verbs:
  - list
  - watch
  - get
- apiGroups:
  - ""
  resources:
  - services
  - events
  - endpoints
  - pods
  - nodes
  - componentstatuses
  - configmaps
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - quota.openshift.io
  resources:
  - clusterresourcequotas
  verbs:
  - get
  - list
- nonResourceURLs:
  - /version
  - /healthz
  verbs:
  - get
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resourceNames:
  - kube-system
  resources:
  - namespaces
  verbs:
  - get

Describe what you expected: No error logs

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc): Datadog-operator chart version: datadog-operator-1.7.1

fanny-jiang commented 2 weeks ago

Hi @eldarmus, I'm unable to reproduce the error logs that you're seeing and my datadog-cluster-agent ClusterRole has the same permissions. I tested on a kind cluster with kubernetes version 1.27. Can you please share the kubernetes version you're using?

eldarmus commented 2 weeks ago

@fanny-jiang thank you for looking into it my kub version: v1.28.2 datadog chart version: datadog-operator-1.8.1

eldarmus commented 2 weeks ago

Here is my DatadogAgent manifest

kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  global:
    site: us5.datadoghq.com
    credentials:
      apiSecret:
        secretName: datadog-operator-apikey
        keyName: api-key
      appSecret:
        secretName: datadog-operator-appkey
        keyName: app-key
    kubelet:
      tlsVerify: false
    clusterName: production-cluster
    tags:
      - team:production-team
      - env:production
  override:
    clusterAgent:
      image:
        name: gcr.io/datadoghq/cluster-agent:latest
    nodeAgent:
      image:
        name: gcr.io/datadoghq/agent:latest
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
        operator: Exists
  features:
    logCollection:
      enabled: true
      containerCollectAll: true
    prometheusScrape:
      enabled: true
      enableServiceEndpoints: true
    eventCollection:
      collectKubernetesEvents: true
khewonc commented 6 days ago

@eldarmus I also wasn't able to reproduce the error on my kind cluster (k8s v1.29, datadog-operator-1.8.1). Are your operator pod and the DatadogAgent object in the same namespace? Could you check for a clusterrole with the name <namespace>-<dda-name>-orch-exp-dca? I think in your case, it would be datadog-datadog-orch-exp-dca. It should have this permission:

$ kubectl get -oyaml clusterrole <namespace>-<dda-name>-orch-exp-dca
[...]
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - list
  - watch

If that's there, then could you also check the clusterrolebinding <namespace>-<dda-name>-orch-exp-dca? It should reference both the clusterrole from above and the service account for the dca <namespace>/datadog-cluster-agent. In my example, my operator and dda are in the default namespace:

$ kubectl describe clusterrole <namespace>-<dda-name>-orch-exp-dca
[...]
Role:
  Kind:  ClusterRole
  Name:  <namespace>-<dda-name>-orch-exp-dca
Subjects:
  Kind            Name                   Namespace
  ----            ----                   ---------
  ServiceAccount  datadog-cluster-agent  default

With the serviceaccount from the clusterrolebinding listed in your dca pod:

$ kubectl get pod <dca-pod> -oyaml
[...]
  serviceAccount: datadog-cluster-agent
  serviceAccountName: datadog-cluster-agent

That should allow the dca pod to have permissions to list CRDs in the apiextensions group that was listed in your error message. Maybe also check that automountServiceAccountToken: false is not set in the serviceaccount or the dca pod. We don't set that in the operator, but perhaps it could be automatically set by some clusters or policies

eldarmus commented 8 hours ago

@khewonc 1) yes they are both in the same namespace 2) datadog-datadog-orch-exp-dca customresourcedefinitions is not listed in ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/instance: datadog
    app.kubernetes.io/managed-by: datadog-operator
    app.kubernetes.io/name: datadog-agent-deployment
    app.kubernetes.io/part-of: datadog-datadog
    app.kubernetes.io/version: ""
    operator.datadoghq.com/managed-by-store: "true"
  name: datadog-datadog-orch-exp-dca
rules:
- apiGroups:
  - ""
  resourceNames:
  - kube-system
  resources:
  - namespaces
  verbs:
  - get
- apiGroups:
  - ""
  resourceNames:
  - datadog-cluster-id
  resources:
  - configmaps
  verbs:
  - get
  - create
  - update
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - nodes
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - deployments
  - replicasets
  - daemonsets
  - statefulsets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  - cronjobs
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - persistentvolumes
  - persistentvolumeclaims
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  - rolebindings
  - clusterroles
  - clusterrolebindings
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling.k8s.io
  resources:
  - verticalpodautoscalers
  verbs:
  - list
  - watch

3) datadog-datadog-orch-exp-dca clusterrolebinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/instance: datadog
    app.kubernetes.io/managed-by: datadog-operator
    app.kubernetes.io/name: datadog-agent-deployment
    app.kubernetes.io/part-of: datadog-datadog
    app.kubernetes.io/version: ""
    operator.datadoghq.com/managed-by-store: "true"
  name: datadog-datadog-orch-exp-dca
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: datadog-datadog-orch-exp-dca
subjects:
- kind: ServiceAccount
  name: datadog-cluster-agent
  namespace: datadog