dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
313 stars 148 forks source link

dask-kubernetes-operator RBAC error running in single namespace deployment mode #916

Closed willyyang closed 3 weeks ago

willyyang commented 4 weeks ago

Describe the issue: When running dask-kubernetes-operator in single namespace mode (non-cluster role), the operator fails with RBAC permission errors trying to list resources at cluster scope, despite being configured for namespace-scoped operation. I am trying to set up dask kubernetes using only Roles/Rolebindings (as the default namespace bound helm deployment provides)

Minimal Complete Verifiable Example:

# 1. Install the operator
helm install -n tst --generate-name dask/dask-kubernetes-operator \
  --set rbac.cluster=false \
  --set kopOfArgs="{--namespace=tst}"
# 2. Output from manifest: helm get manifest dask-kubernetes-operator-xxx  -n tst 
---
# Source: dask-kubernetes-operator/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dask-kubernetes-operator-1730138415
  labels:
    helm.sh/chart: dask-kubernetes-operator-2024.9.0
    app.kubernetes.io/name: dask-kubernetes-operator
    app.kubernetes.io/instance: dask-kubernetes-operator-1730138415
    app.kubernetes.io/version: "2022.4.1"
    app.kubernetes.io/managed-by: Helm
---
# Source: dask-kubernetes-operator/templates/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dask-kubernetes-operator-1730138415-role
rules:
  # Framework: knowing which other operators are running (i.e. peering).
  - apiGroups: [kopf.dev]
    resources: [clusterkopfpeerings]
    verbs: [list, watch, patch, get]

  # Framework: runtime observation of namespaces & CRDs (addition/deletion).
  - apiGroups: [apiextensions.k8s.io]
    resources: [customresourcedefinitions]
    verbs: [list, watch]
  - apiGroups: [""]
    resources: [namespaces]
    verbs: [list, watch]

  # Framework: admission webhook configuration management.
  - apiGroups:
      [admissionregistration.k8s.io/v1, admissionregistration.k8s.io/v1beta1]
    resources: [validatingwebhookconfigurations, mutatingwebhookconfigurations]
    verbs: [create, patch]

  # Application: watching & handling for the custom resource we declare.
  - apiGroups: [kubernetes.dask.org]
    resources: [daskclusters, daskworkergroups, daskjobs, daskjobs/status, daskautoscalers, daskworkergroups/scale]
    verbs: [get, list, watch, patch, create, delete]

  # Application: other resources it produces and manipulates.
  # Here, we create/delete Pods.
  - apiGroups: [""]
    resources: [pods, pods/status]
    verbs: ["*"]

  - apiGroups: [""]
    resources: [services, services/status]
    verbs: ["*"]

  - apiGroups: ["apps"]
    resources: [deployments, deployments/status]
    verbs: ["*"]

  - apiGroups: ["", events.k8s.io]
    resources: [events]
    verbs: ["*"]
---
# Source: dask-kubernetes-operator/templates/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dask-kubernetes-operator-1730138415-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: dask-kubernetes-operator-1730138415-role
subjects:
  - kind: ServiceAccount
    name: dask-kubernetes-operator-1730138415
    namespace: tst
---
# Source: dask-kubernetes-operator/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dask-kubernetes-operator-1730138415
  labels:
    helm.sh/chart: dask-kubernetes-operator-2024.9.0
    app.kubernetes.io/name: dask-kubernetes-operator
    app.kubernetes.io/instance: dask-kubernetes-operator-1730138415
    app.kubernetes.io/version: "2022.4.1"
    app.kubernetes.io/managed-by: Helm
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app.kubernetes.io/name: dask-kubernetes-operator
      app.kubernetes.io/instance: dask-kubernetes-operator-1730138415
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dask-kubernetes-operator
        app.kubernetes.io/instance: dask-kubernetes-operator-1730138415
    spec:
      serviceAccountName: dask-kubernetes-operator-1730138415
      securityContext:
        {}
      containers:
        - name: dask-kubernetes-operator
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          image: "ghcr.io/dask/dask-kubernetes-operator:2024.9.0"
          imagePullPolicy: IfNotPresent
          env:
          args:
            - --liveness=http://0.0.0.0:8080/healthz
            - --all-namespaces
          resources:
            {}
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
          volumeMounts:
            []
      volumes:
        []

When reviewing the logs from dask-kubernetes-operator-xx, I see the following RBAC errors with the common theme being that the serviceaccount is trying to list resources at the cluster scope but doesn't have permissions to do so.:

pods is forbidden: User "system:serviceaccount:tst:dask-kubernetes-operator-1730139688" cannot list resource "pods" in API group "" at the cluster scope

daskautoscalers.kubernetes.dask.org is forbidden: User "system:serviceaccount:tst:dask-kubernetes-operator-1730139688" cannot list resource "daskautoscalers" in API group "kubernetes.dask.org" at the cluster scope

daskclusters.kubernetes.dask.org is forbidden: User "system:serviceaccount:tst:dask-kubernetes-operator-1730139688" cannot list resource "daskclusters" in API group "kubernetes.dask.org" at the cluster scope

daskworkergroups.kubernetes.dask.org is forbidden: User "system:serviceaccount:tst:dask-kubernetes-operator-1730139688" cannot list resource "daskworkergroups" in API group "kubernetes.dask.org" at the cluster scope

daskjobs.kubernetes.dask.org is forbidden: User "system:serviceaccount:tst:dask-kubernetes-operator-1730139688" cannot list resource "daskjobs" in API group "kubernetes.dask.org" at the cluster scope

services is forbidden: User "system:serviceaccount:tst:dask-kubernetes-operator-1730139688" cannot list resource "services" in API group "" at the cluster scope

I have tried creating rolebindings with clusterroles which resulted in the same error:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dask-operator-namespaced-role
rules:
- apiGroups:
  - ""  # Core API group
  resources:
  - pods
  - services
  - events
  - configmaps
  - secrets
  verbs:
  - "*"
- apiGroups:
  - apps
  resources:
  - deployments
  verbs:
  - "*"
- apiGroups:
  - kubernetes.dask.org
  resources:
  - daskclusters
  - daskworkergroups
  - daskjobs
  - daskautoscalers
  - daskclusters/status
  - daskworkergroups/status
  - daskjobs/status
  - daskautoscalers/status
  verbs:
  - "*"
- apiGroups:
  - kopf.dev
  resources:
  - kopfpeerings
  verbs:
  - "*"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dask-operator-binding
  namespace: tst
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: dask-operator-namespaced-role
subjects:
- kind: ServiceAccount
  name: dask-kubernetes-operator-1730139688
  namespace: tst

If I create clusterrolebindings on clusterroles then the dask-kubernetes-operator doesn't initialize with any errors. Is it not possible to run the dask-kubernetes-operator without a clusterrolebinding? :

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dask-operator-cluster-role
rules:
- apiGroups:
  - ""  # Core API group
  resources:
  - pods
  - services
  - events
  - configmaps
  - secrets
  verbs:
  - "*"
- apiGroups:
  - apps
  resources:
  - deployments
  verbs:
  - "*"
- apiGroups:
  - kubernetes.dask.org
  resources:
  - daskclusters
  - daskworkergroups
  - daskjobs
  - daskautoscalers
  - daskclusters/status
  - daskworkergroups/status
  - daskjobs/status
  - daskautoscalers/status
  verbs:
  - "*"
- apiGroups:
  - kopf.dev
  resources:
  - kopfpeerings
  verbs:
  - "*"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dask-operator-cluster-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: dask-operator-cluster-role
subjects:
- kind: ServiceAccount
  name: dask-kubernetes-operator-1730139688
  namespace: tst

Anything else we need to know?:

Environment:

jacobtomlinson commented 3 weeks ago

I think we may need to tell kopf to only watch at the namespace scope. You can do this with the --namespace argument when starting up kopf.

Could you try adding this to the kopfArgs in your values.yaml when installing the helm chart and see if that resolves the problem?

jacobtomlinson commented 3 weeks ago

The error messages suggest that this isn't having the desired effect. Could you describe the controller Pod and verify the exact command and arguments that are being passed.

jacobtomlinson commented 3 weeks ago

I'm seeing the default --all-namespaces in the args instead of --namespace=tst. So I don't think you're setting the config correctly.

jacobtomlinson commented 3 weeks ago

Oh yeah looks like you have a typo kopOfArgs in your command.

jacobtomlinson commented 3 weeks ago

No I'm referring to the --set kopOfArgs="{--namespace=tst}". You have an extra O in there. It should be --set kopfArgs="{--namespace=tst}".

willyyang commented 3 weeks ago

Thanks!

jacobtomlinson commented 3 weeks ago

Glad you got things working @willyyang! In future don't feel like you need to delete comments, it might be useful for future readers to see us working through the debugging steps 😃.