dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
136 stars 88 forks source link

AttributeError: 'GatewayCluster' object has no attribute 'wait_for_workers' #782

Closed abprime closed 8 months ago

abprime commented 9 months ago

Describe the issue: The new version 2023.9.0 is giving an attribute error for wait_for_workers. This was working in the earlier version 2023.1.1. The method on the cluster is called internally from the distributed client method wait_for_workers.

Is there any alternative way to wait for the workers?

AttributeError                            Traceback (most recent call last)
Cell In[41], [line 1](vscode-notebook-cell:?execution_count=41&line=1)
----> [1](vscode-notebook-cell:?execution_count=41&line=1) client.wait_for_workers(8)

File [~/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/distributed/client.py:1469](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/abprime/dde2/services/dde-analytics-core/tests/~/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/distributed/client.py:1469), in Client.wait_for_workers(self, n_workers, timeout)
   [1466](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/abprime/dde2/services/dde-analytics-core/tests/~/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/distributed/client.py:1466) if self.cluster is None:
   [1467](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/abprime/dde2/services/dde-analytics-core/tests/~/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/distributed/client.py:1467)     return self.sync(self._wait_for_workers, n_workers, timeout=timeout)
-> [1469](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/abprime/dde2/services/dde-analytics-core/tests/~/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/distributed/client.py:1469) return self.cluster.wait_for_workers(n_workers, timeout)

AttributeError: 'GatewayCluster' object has no attribute 'wait_for_workers'
2023-12-12 17:56:15,196 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/iostream.py", line 1367, in _do_ssl_handshake
    self.socket.do_handshake()
  File "/home/abprime/anaconda3/envs/py310/lib/python3.10/ssl.py", line 1342, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/abprime/anaconda3/envs/py310/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 202, in _handle_events
    handler_func(fileobj, events)
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/iostream.py", line 691, in _handle_events
    self._handle_read()
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/iostream.py", line 1427, in _handle_read
    self._do_ssl_handshake()
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/iostream.py", line 1385, in _do_ssl_handshake
    return self.close(exc_info=err)
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/iostream.py", line 606, in close
    self._signal_closed()
  File "/home/abprime/dde2/services/dde-analytics-core/.venv/lib/python3.10/site-packages/tornado/iostream.py", line 636, in _signal_closed
    self._ssl_connect_future.exception()
asyncio.exceptions.CancelledError

Minimal Complete Verifiable Example:

gateway = Gateway(
    address=DASK_GATEWAY_URL,
    auth=BasicAuth(
        password=DASK_BASIC_AUTH_PASSWORD,
    ),
    asynchronous=False,
)
cluster = gateway.new_cluster()
cluster.adapt(minimum=6, maximum=12)
client = cluster.get_client()

client.wait_for_workers(6) // this line raises the Error

Anything else we need to know?:

Environment:

TomAugspurger commented 9 months ago

Seems like that might have been from https://github.com/dask/distributed/pull/6700. That's now requiring a new Cluster.wait_for_workers method, that isn't on GatewayCluster.

We could implement that (PR would be great). It might be worth opening an issue on dask/distributed to confirm whether that change to the cluster interface was intention (it kind of looks incidental to the intent of the PR, but I haven't looked closely).

consideRatio commented 8 months ago

This was an upstream issue fixed in dask/distributed#8441 part of distributed>=2024.1.0, so use of that version should resolve this I think.