dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
136 stars 88 forks source link

GatewayCluster throws 'Method Not Allowed' #622

Open tiborkiss opened 2 years ago

tiborkiss commented 2 years ago

I have an EKS based Dask setup, which was working fine two weeks ago. Yestarday when I return to continue my work, the call of GatewayCluster() already throws ClientResponseError: 405, message='Method Not Allowed', url=URL('http://proxy-public/services/dask-gateway/api/v1/clusters/')

Minimal Complete Verifiable Example:

import dask
import dask.array as da

from dask.distributed import performance_report, progress
from dask_gateway import GatewayCluster

cluster = GatewayCluster(worker_cores=0.8, worker_memory=3.3)

and throws this

---------------------------------------------------------------------------
ClientResponseError                       Traceback (most recent call last)
Input In [7], in <cell line: 3>()
      1 # Specifying a bit less than whole number values for cpu cores and memory allows Dask worker pods to be packed more tightly onto 
      2 # the underlying EC2 instances.
----> 3 cluster = GatewayCluster(worker_cores=0.8, worker_memory=3.3)
      4 cluster.scale(8)
      5 client = cluster.get_client()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:815, in GatewayCluster.__init__(self, address, proxy_address, public_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    803 def __init__(
    804     self,
    805     address=None,
   (...)
    813     **kwargs,
    814 ):
--> 815     self._init_internal(
    816         address=address,
    817         proxy_address=proxy_address,
    818         public_address=public_address,
    819         auth=auth,
    820         cluster_options=cluster_options,
    821         cluster_kwargs=kwargs,
    822         shutdown_on_close=shutdown_on_close,
    823         asynchronous=asynchronous,
    824         loop=loop,
    825     )

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:920, in GatewayCluster._init_internal(self, address, proxy_address, public_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    918     self.status = "starting"
    919 if not self.asynchronous:
--> 920     self.gateway.sync(self._start_internal)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:344, in Gateway.sync(self, func, *args, **kwargs)
    340 future = asyncio.run_coroutine_threadsafe(
    341     func(*args, **kwargs), self.loop.asyncio_loop
    342 )
    343 try:
--> 344     return future.result()
    345 except BaseException:
    346     future.cancel()

File /srv/conda/envs/notebook/lib/python3.9/concurrent/futures/_base.py:446, in Future.result(self, timeout)
    444     raise CancelledError()
    445 elif self._state == FINISHED:
--> 446     return self.__get_result()
    447 else:
    448     raise TimeoutError()

File /srv/conda/envs/notebook/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
    389 if self._exception:
    390     try:
--> 391         raise self._exception
    392     finally:
    393         # Break a reference cycle with the exception in self._exception
    394         self = None

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:934, in GatewayCluster._start_internal(self)
    932     self._start_task = asyncio.ensure_future(self._start_async())
    933 try:
--> 934     await self._start_task
    935 except BaseException:
    936     # On exception, cleanup
    937     await self._stop_internal()

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:947, in GatewayCluster._start_async(self)
    945 if self.status == "created":
    946     self.status = "starting"
--> 947     self.name = await self.gateway._submit(
    948         cluster_options=self._cluster_options, **self._cluster_kwargs
    949     )
    950 # Connect to cluster
    951 try:

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:536, in Gateway._submit(self, cluster_options, **kwargs)
    534     options = self._config_cluster_options()
    535     options.update(kwargs)
--> 536 resp = await self._request("POST", url, json={"cluster_options": options})
    537 data = await resp.json()
    538 return data["name"]

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:420, in Gateway._request(self, method, url, json)
    418         raise GatewayServerError(msg)
    419     else:
--> 420         resp.raise_for_status()
    421 else:
    422     return resp

File /srv/conda/envs/notebook/lib/python3.9/site-packages/aiohttp/client_reqrep.py:1004, in ClientResponse.raise_for_status(self)
   1002 assert self.reason is not None
   1003 self.release()
-> 1004 raise ClientResponseError(
   1005     self.request_info,
   1006     self.history,
   1007     status=self.status,
   1008     message=self.reason,
   1009     headers=self.headers,
   1010 )

ClientResponseError: 405, message='Method Not Allowed', url=URL('http://proxy-public/services/dask-gateway/api/v1/clusters/')

Two weeks ago was working fine. The Kubernetes cluster starts fine, I can also login in jupyter lab. I noticed a significant difference compared to the two weeks run, when I open a terminal in jupyter lab, there I don't have anymore the aws client. Two weeks ago was there.

Environment: Here is my daskub.yaml.. of course I removed the secrets.

jupytadminerhub:
  singleuser:
    extraAnnotations:
      iam.amazonaws.com/role: arn:aws:iam::<.....>:role/jupyter-notebook
    image:
      name: pangeo/pangeo-notebook
      tag: "2021.05.04"
    cpu:
      limit: 2
      guarantee: 1
    memory:
      limit: 4G
      guarantee: 2G
    cloudMetadata:
      blockWithIptables: false
    extraEnv:
      DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
  proxy:
    secretToken: "<secret1.......>"
    https:
      enabled: false
      type: offload
    service:
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
        service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"
  hub:
    config:
      Authenticator:
        admin_users:
          - admin
      DummyAuthenticator:
        password: <secret3.>
      JupyterHub:
        authenticator_class: dummy
    services:
      dask-gateway:
        apiToken: "<secret2.......>"

dask-gateway:
  gateway:
    backend:
      worker:
        extraPodConfig:
          nodeSelector:
            eks.amazonaws.com/capacityType: ON_DEMAND
    extraConfig:
      optionHandler: |
        from dask_gateway_server.options import Options, Integer, Float, String
        def option_handler(options):
            if ":" not in options.image:
                raise ValueError("When specifying an image you must also provide a tag")
            return {
                "worker_cores": options.worker_cores,
                "worker_memory": int(options.worker_memory * 2 ** 30),
                "image": options.image,
            }
        c.Backend.cluster_options = Options(
            Float("worker_cores", default=0.8, min=0.8, max=4.0, label="Worker Cores"),
            Float("worker_memory", default=3.3, min=1, max=8, label="Worker Memory (GiB)"),
            String("image", default="pangeo/base-notebook:2021.05.04", label="Image"),
            handler=option_handler,
        )
    auth:
      jupyterhub:
        apiToken: "<secret2.......>"

I tried with pangeo docker image verion 2022.09.21, which I picked from https://github.com/pangeo-data/pangeo-docker-images/tags. Exactly the same result.

consideRatio commented 2 years ago

What version of the helm chart is installed? Is it the latest release released yesterday?

Check by inspecting labels on the dask-gateway pods for example.

Oh daskhub, okay hmm then you should be still using the old version and that means i didnt break something yesterday.

Hmmm, unsure what has went wrong here, but just ruled out a regression.

tiborkiss commented 2 years ago

it is dask-gateway-2022.6.1

app.kubernetes.io/component=traefik
app.kubernetes.io/instance=daskhub
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=dask-gateway
app.kubernetes.io/version=2022.6.1
gateway.dask.org/instance=daskhub-dask-gateway
helm.sh/chart=dask-gateway-2022.6.1
pod-template-hash=d7cc865bc
tiborkiss commented 2 years ago

In traefik-daskhub-dask-gateway pod I see this in the log:

time="2022-10-14T09:19:04Z" level=info msg="Configuration loaded from flags."
time="2022-10-14T09:19:04Z" level=warning msg="Cross-namespace reference between IngressRoutes and resources is enabled, please ensure that this is expected (see AllowCrossNamespace option)" providerName=kubernetescrd
time="2022-10-14T09:19:04Z" level=error msg="subset not found for default/api-daskhub-dask-gateway" providerName=kubernetescrd namespace=default ingress=api-daskhub-dask-gateway
time="2022-10-14T09:19:06Z" level=error msg="subset not found for default/api-daskhub-dask-gateway" ingress=api-daskhub-dask-gateway namespace=default providerName=kubernetescrd
time="2022-10-14T09:34:03Z" level=error msg="subset not found for default/api-daskhub-dask-gateway" namespace=default providerName=kubernetescrd ingress=api-daskhub-dask-gateway

In jupyter-admin pod:

[I 2022-10-14 09:22:16.339 SingleUserLabApp mixins:648] Starting jupyterhub-singleuser server version 2.3.1
[W 2022-10-14 09:22:16.344 SingleUserLabApp _version:68] jupyterhub version 1.5.0 != jupyterhub-singleuser version 2.3.1. This could cause failure to authenticate and result in redirect loops!
[I 2022-10-14 09:22:16.344 SingleUserLabApp serverapp:2726] Serving notebooks from local directory: /home/jovyan
[I 2022-10-14 09:22:16.344 SingleUserLabApp serverapp:2726] Jupyter Server 1.18.1 is running at:

Everything else looks normal.

consideRatio commented 2 years ago

Have you made an update to the helm chart as part of observing this change?

You may have made an upgrade of daskhub without adjusting to the breaking changes in dask-gateway that was upgraded at some point in daskhub. See https://gateway.dask.org/changelog.html#id12.

tiborkiss commented 2 years ago

Since this is just a PoC state, I recreated everything, including helm repo remove dask. I checked in AWS console that everything is removed, including the jupyter hub image, then recreate from scratch. I checked the version of helm in my console is 3.9.0, etc. I think that "breaking changes" are not the case here.

Anyway, thank you for the tips. I have to admit that I am not a k8s and nor helm charts specialist, therefore any hints are helpful. I just search the clean, repeatable solution, to capture where are the risk points to break the system during development without intent. Therefore right now I am just zapping and recreating.. then later, the backend team will take-over with contiunous operations etc. Until now, I have recreated from scratch, probably 3 times and I had no issue, then after one week holiday I came back and now it has this.