dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 148 forks source link

dask-kubernetes-operator-role-cluster clusterrole does not have the needed ACL against pods/portforward resource #909

Open oe-hbk opened 1 week ago

oe-hbk commented 1 week ago

Describe the issue: The dask-kubernetes-operator pod shows an 403 Forbidden error when trying to access the k8s api. It does not seem to have the right cluster role permissions

[2024-10-08 21:48:24,704] httpx                [INFO    ] HTTP Request: GET https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false " HTTP/1.1 403 Forbidden"

Execcing into the pod and trying the same call against the API.

kubectl exec -it -n dask-system dask-kubernetes-operator-78d4b784cf-4r455 -- sh

$ SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
$ NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
$ TOKEN=$(cat ${SERVICEACCOUNT}/token)
$ CACERT=${SERVICEACCOUNT}/ca.crt
$ curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET 'https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false'
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "pods \"MYPOD\" is forbidden: User \"system:serviceaccount:dask-system:dask-kubernetes-operator
\" cannot get resource \"pods/portforward\" in API group \"\" in the namespace \"MYNAMESPACE\"",
  "reason": "Forbidden",
  "details": {
    "name": "MYPOD",
    "kind": "pods"
  },
  "code": 403
}$

Editing the clusterrole,

$ kubectl edit clusterrole -n dask-system dask-kubernetes-operator-role-cluster

And adding pods/portforward

Around https://github.com/dask/dask-kubernetes/blob/ab1be696d03a8963f0db120e0de993f3eda12930/dask_kubernetes/operator/deployment/helm/dask-kubernetes-operator/templates/clusterrole.yaml#L34

and restarting the application pod corrected the problem.

Environment:

jacobtomlinson commented 1 week ago

Thanks for raising this. I wouldn't necessarily expect the controller Pod to be opening port forwards to the scheduler Pods, so there may be a deeper issue going on. Generally the controller will attempt to connect directly to the scheduler Pod, and that may be failing for some reason and so it is falling back to a port forward.

Could you check your logs for other failing connection messages?

oe-hbk commented 1 week ago

Thanks @jacobtomlinson .

The following was also seen in the operator pod log:

[2024-10-08 21:46:04,848] kopf.objects         [ERROR   ] [MYNAMESPACE/MYPOD_SHORTNAME_autoscaler] Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 812, in daskautoscaler_adapt
    desired_workers = await get_desired_workers(
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 520, in get_desired_workers
    async with session.get(url) as resp:
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 608, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/usr/local/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
jacobtomlinson commented 1 week ago

Yeah I'm not surprised by that one. We have three levels of fallback when communicating with the scheduler:

Your initial message is failing on that last step. But I'm curious why the middle step is failing at all.