Open thaisarcanjo-ow opened 1 month ago
To provide some extra information, seems like the operator tries 3 times to get the information of which worker/deployment to remove:
From the logs, we see the first two failed, which was a bit unexpected given the operator can scale up the workers. We added to the operator some params to get the debug logs with
helm install --repo https://helm.dask.org --create-namespace -n dask-operator dask-kubernetes-operator dask-kubernetes-operator --set kopfArgs="{--all-namespaces,--verbose,--debug}"
and could see that there were some 404 on the response body (would be useful to see which request it was) and after digging through the issues here, this one https://github.com/dask/dask-kubernetes/issues/807 gave some light on adding distributed.http.scheduler.api
to the distributed.scheduler.http.routes
Dask config, so added that to the config map as:
# config map settings applied to the dask-cluster
distributed:
scheduler:
http:
routes:
- distributed.http.scheduler.prometheus
- distributed.http.scheduler.info
- distributed.http.scheduler.json
- distributed.http.health
- distributed.http.proxy
- distributed.http.statics
- distributed.http.scheduler.api
then recreated the scheduler and we could see that likely the first http call on getting the workers to retire returned the right name (which seems to be the value from the env var DASK_WORKER_NAME, given when we open the dashboard the workers are named like that, ie matching deployment name) and they are then getting removed after all tasks were computed:
[2024-10-15 15:40:04,912] kopf.objects [INFO ] [my-namespace/dask-autoscaler] Autoscaler updated dask-cluster worker count from 2 to 1
[2024-10-15 15:40:04,912] kopf.objects [INFO ] [my-namespace/dask-autoscaler] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-15 15:40:04,997] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?fieldSelector=metadata.name%3Ddask-cluster "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,022] httpx [INFO ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?labelSelector=dask.org%2Fworkergroup-name%3Ddask-cluster-default "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,034] kopf.objects [INFO ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default up to 1 workers.
[2024-10-15 15:40:05,041] httpx [INFO ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services?fieldSelector=metadata.name%3Ddask-cluster-scheduler "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,057] kopf.objects [INFO ] [my-namespace/dask-cluster-default] Retired workers {'tcp://172.18.69.197:34793': {'type': 'Worker', 'id': 'dask-cluster-default-worker-9e4e522e22', 'host': '172.18.69.197', 'resources': {}, 'local_directory': '/tmp/dask-scratch-space/worker-20y99qa3', 'name': 'dask-cluster-default-worker-9e4e522e22', 'nthreads': 1, 'memory_limit': 12000000000, 'last_seen': 1729006804.7547565, 'services': {'dashboard': 44215}, 'metrics': {'task_counts': {}, 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}, 'digests_total_since_heartbeat': {'tick-duration': 0.5005748271942139, 'latency': 0.0019073486328125}, 'managed_bytes': 0, 'spilled_bytes': {'memory': 0, 'disk': 0}, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 12, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 36}, 'event_loop_interval': 0.020009407997131346, 'cpu': 4.0, 'memory': 187879424, 'time': 1729006804.256867, 'host_net_io': {'read_bps': 285.6785263997705, 'write_bps': 1480.334182253356}, 'host_disk_io': {'read_bps': 8182.791917017202, 'write_bps': 270032.1332615676}, 'num_fds': 22}, 'status': 'closed', 'nanny': 'tcp://172.18.69.197:41727'}}
[2024-10-15 15:40:05,058] kopf.objects [INFO ] [my-namespace/dask-cluster-default] Workers to close: ['dask-cluster-default-worker-9e4e522e22']
[2024-10-15 15:40:05,067] httpx [INFO ] HTTP Request: DELETE https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/dask-cluster-default-worker-9e4e522e22 "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,067] kopf.objects [INFO ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default down to 1 workers.
[2024-10-15 15:40:05,068] kopf.objects [INFO ] [my-namespace/dask-cluster-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2024-10-15 15:40:05,068] kopf.objects [INFO ] [my-namespace/dask-cluster-default] Updating is processed: 1 succeeded; 0 failed.
[2024-10-15 15:40:07,830] kopf.objects [INFO ] [my-namespace/dask-cluster] Timer 'daskcluster_autoshutdown' succeeded.
Is this setting distributed.http.scheduler.api
correct to add to have the downscale bit of autoscaler working? That wasn't required to get the scale up bit working (workers are created correctly)
There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.
Reproducing steps:
helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator
, ie this quick start step.The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete async with self.api.call_api( File "/usr/local/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api raise ServerError( kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found
kubectl get pods -l dask.org/cluster-name=simple NAME READY STATUS RESTARTS AGE simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 1/1 Running 0 9m36s simple-default-worker-54afdedac5-6bdb8f746b-7lzsg 1/1 Running 0 9m36s simple-scheduler-78db7fbfd8-zmwgr 1/1 Running 0 9m36s
kubectl get deployments -l dask.org/cluster-name=simple NAME READY UP-TO-DATE AVAILABLE AGE simple-default-worker-057ae426b6 1/1 1 1 15m simple-default-worker-54afdedac5 1/1 1 1 15m simple-scheduler 1/1 1 1 15m