Open oe-hbk opened 1 week ago
Thanks for raising this. I wouldn't necessarily expect the controller Pod to be opening port forwards to the scheduler Pods, so there may be a deeper issue going on. Generally the controller will attempt to connect directly to the scheduler Pod, and that may be failing for some reason and so it is falling back to a port forward.
Could you check your logs for other failing connection messages?
Thanks @jacobtomlinson .
The following was also seen in the operator pod log:
[2024-10-08 21:46:04,848] kopf.objects [ERROR ] [MYNAMESPACE/MYPOD_SHORTNAME_autoscaler] Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
result = await invoke_handler(
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
result = await invocation.invoke(
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
result = await fn(**kwargs) # type: ignore
File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 812, in daskautoscaler_adapt
desired_workers = await get_desired_workers(
File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 520, in get_desired_workers
async with session.get(url) as resp:
File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
self._resp = await self._coro
File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 608, in _request
await resp.start(conn)
File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File "/usr/local/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
Yeah I'm not surprised by that one. We have three levels of fallback when communicating with the scheduler:
aiohttp
error above)Your initial message is failing on that last step. But I'm curious why the middle step is failing at all.
Describe the issue: The dask-kubernetes-operator pod shows an 403 Forbidden error when trying to access the k8s api. It does not seem to have the right cluster role permissions
Execcing into the pod and trying the same call against the API.
Editing the clusterrole,
And adding pods/portforward
Around https://github.com/dask/dask-kubernetes/blob/ab1be696d03a8963f0db120e0de993f3eda12930/dask_kubernetes/operator/deployment/helm/dask-kubernetes-operator/templates/clusterrole.yaml#L34
and restarting the application pod corrected the problem.
Environment: