dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
312 stars 148 forks source link

Operator unable to delete Kubernetes Deployment #910

Open thaisarcanjo-ow opened 1 month ago

thaisarcanjo-ow commented 1 month ago

There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.

Reproducing steps:

  1. Install the Operator with helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator, ie this quick start step.
  2. Create the cluster using the default yaml available from this guide as is. At this stage, two workers would be available from two deployments.
  3. Create an autoscaler with the min workers set to 0 and install it
# autoscaler.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: simple
spec:
  cluster: "simple"
  minimum: 0  # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
  maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests
  1. Apply this AutoScaler settings:
    kubectl apply -f autoscaler.yaml
    daskautoscaler.kubernetes.dask.org/simple created
  2. At this stage, the operator would already try to remove some deployments, but it is attempting to delete a Deployment resouirce that matches the Pod name, which doesn't exist:
    
    [2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Autoscaler updated simple worker count from 2 to 1
    [2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Timer 'daskautoscaler_adapt' succeeded.
    [2024-10-14 09:22:42,662] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/kubernetes.dask.org/v1/namespaces/default/daskclusters?fieldSelector=metadata.name%3Dsimple "HTTP/1.1 200 OK"
    [2024-10-14 09:22:42,668] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/apps/v1/namespaces/default/deployments?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
    [2024-10-14 09:22:42,673] kopf.objects         [INFO    ] [default/simple-default] Scaled worker group simple-default up to 1 workers.
    [2024-10-14 09:22:42,677] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
    [2024-10-14 09:22:42,687] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
    [2024-10-14 09:22:42,693] kopf.objects         [WARNING ] [default/simple-default] Scaling simple-default failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html.
    [2024-10-14 09:22:42,697] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/pods?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
    [2024-10-14 09:22:42,701] kopf.objects         [INFO    ] [default/simple-default] Workers to close: ['simple-default-worker-057ae426b6-79bcbdb84b-vlcn7']
    [2024-10-14 09:22:42,705] httpx                [INFO    ] HTTP Request: DELETE https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 "HTTP/1.1 404 Not Found"
    [2024-10-14 09:22:42,705] kopf.objects         [ERROR   ] [default/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
    Traceback (most recent call last):
    File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api
    response.raise_for_status()
    File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
    httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7'
    For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete async with self.api.call_api( File "/usr/local/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api raise ServerError( kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found


If I check the pods, the name `simple-default-worker-057ae426b6-79bcbdb84b-vlcn7` of the deployment it tried to delete indeed exists, but as a worker pod:

kubectl get pods -l dask.org/cluster-name=simple NAME READY STATUS RESTARTS AGE simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 1/1 Running 0 9m36s simple-default-worker-54afdedac5-6bdb8f746b-7lzsg 1/1 Running 0 9m36s simple-scheduler-78db7fbfd8-zmwgr 1/1 Running 0 9m36s


However, the deployment name that controls this pod has a different name:

kubectl get deployments -l dask.org/cluster-name=simple NAME READY UP-TO-DATE AVAILABLE AGE simple-default-worker-057ae426b6 1/1 1 1 15m simple-default-worker-54afdedac5 1/1 1 1 15m simple-scheduler 1/1 1 1 15m



As you can see, the deployment that controls that worker pod is actually named `simple-default-worker-057ae426b6` instead of `simple-default-worker-057ae426b6-79bcbdb84b-vlcn7`, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from [this line](https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L709)here the deletion using worker name as expected Deployment name.

**Anything else we need to know?**:
This may be relate to #855

**Environment**:
- Dask version: 2024.9.1
- Python version: 3.11
- Operating System: Mac/Linux
- Install method (conda, pip, source): pip
thaisarcanjo-ow commented 1 month ago

To provide some extra information, seems like the operator tries 3 times to get the information of which worker/deployment to remove:

  1. Dashboard http here
  2. Dask RCP here
  3. Kubernetes API here (I think the fallback option should not be Pods but Deployments here)

From the logs, we see the first two failed, which was a bit unexpected given the operator can scale up the workers. We added to the operator some params to get the debug logs with

helm install --repo https://helm.dask.org --create-namespace -n dask-operator dask-kubernetes-operator dask-kubernetes-operator --set kopfArgs="{--all-namespaces,--verbose,--debug}"

and could see that there were some 404 on the response body (would be useful to see which request it was) and after digging through the issues here, this one https://github.com/dask/dask-kubernetes/issues/807 gave some light on adding distributed.http.scheduler.api to the distributed.scheduler.http.routes Dask config, so added that to the config map as:

    # config map settings applied to the dask-cluster
    distributed:
      scheduler:
        http:
          routes:
          - distributed.http.scheduler.prometheus
          - distributed.http.scheduler.info
          - distributed.http.scheduler.json
          - distributed.http.health
          - distributed.http.proxy
          - distributed.http.statics
          - distributed.http.scheduler.api 

then recreated the scheduler and we could see that likely the first http call on getting the workers to retire returned the right name (which seems to be the value from the env var DASK_WORKER_NAME, given when we open the dashboard the workers are named like that, ie matching deployment name) and they are then getting removed after all tasks were computed:

[2024-10-15 15:40:04,912] kopf.objects         [INFO    ] [my-namespace/dask-autoscaler] Autoscaler updated dask-cluster worker count from 2 to 1
[2024-10-15 15:40:04,912] kopf.objects         [INFO    ] [my-namespace/dask-autoscaler] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-15 15:40:04,997] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?fieldSelector=metadata.name%3Ddask-cluster "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,022] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?labelSelector=dask.org%2Fworkergroup-name%3Ddask-cluster-default "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,034] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default up to 1 workers.
[2024-10-15 15:40:05,041] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services?fieldSelector=metadata.name%3Ddask-cluster-scheduler "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,057] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Retired workers {'tcp://172.18.69.197:34793': {'type': 'Worker', 'id': 'dask-cluster-default-worker-9e4e522e22', 'host': '172.18.69.197', 'resources': {}, 'local_directory': '/tmp/dask-scratch-space/worker-20y99qa3', 'name': 'dask-cluster-default-worker-9e4e522e22', 'nthreads': 1, 'memory_limit': 12000000000, 'last_seen': 1729006804.7547565, 'services': {'dashboard': 44215}, 'metrics': {'task_counts': {}, 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}, 'digests_total_since_heartbeat': {'tick-duration': 0.5005748271942139, 'latency': 0.0019073486328125}, 'managed_bytes': 0, 'spilled_bytes': {'memory': 0, 'disk': 0}, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 12, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 36}, 'event_loop_interval': 0.020009407997131346, 'cpu': 4.0, 'memory': 187879424, 'time': 1729006804.256867, 'host_net_io': {'read_bps': 285.6785263997705, 'write_bps': 1480.334182253356}, 'host_disk_io': {'read_bps': 8182.791917017202, 'write_bps': 270032.1332615676}, 'num_fds': 22}, 'status': 'closed', 'nanny': 'tcp://172.18.69.197:41727'}}
[2024-10-15 15:40:05,058] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Workers to close: ['dask-cluster-default-worker-9e4e522e22']
[2024-10-15 15:40:05,067] httpx                [INFO    ] HTTP Request: DELETE https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/dask-cluster-default-worker-9e4e522e22 "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,067] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default down to 1 workers.
[2024-10-15 15:40:05,068] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2024-10-15 15:40:05,068] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Updating is processed: 1 succeeded; 0 failed.
[2024-10-15 15:40:07,830] kopf.objects         [INFO    ] [my-namespace/dask-cluster] Timer 'daskcluster_autoshutdown' succeeded.

Is this setting distributed.http.scheduler.api correct to add to have the downscale bit of autoscaler working? That wasn't required to get the scale up bit working (workers are created correctly)