coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Joblib dask backend doesn't work through Windows firewall #86

Closed therriault closed 1 year ago

therriault commented 3 years ago

Running on Windows 10 and following the Dask for ML guidance to distribute tuning of a model that I've been running locally. (And running successfully - all the code works fine locally, and the only change I made was to set up a Dask cluster on coiled and then swap out the grid_search.fit(X,y) piece for the with joblib... grid_search.fit(X,y) bit).

When I do that, I get this error:

[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 40 concurrent workers.
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x000001FB06E52888>>, <Task finished coro=<DaskDistributedBackend.apply_async.<locals>.f() done, defined at C:\Anaconda\envs\Py37\lib\site-packages\joblib\_dask.py:316> exception=CommClosedError('in <closed TLS>: Stream is closed')>)
Traceback (most recent call last):
  File "C:\Anaconda\envs\Py37\lib\site-packages\distributed\comm\tcp.py", line 186, in read
    n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Anaconda\envs\Py37\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback
    ret = callback()
  File "C:\Anaconda\envs\Py37\lib\site-packages\tornado\ioloop.py", line 767, in _discard_future_result
    future.result()
  File "C:\Anaconda\envs\Py37\lib\site-packages\joblib\_dask.py", line 317, in f
    batch, tasks = await self._to_func_args(func)
  File "C:\Anaconda\envs\Py37\lib\site-packages\joblib\_dask.py", line 304, in _to_func_args
    args = list(await maybe_to_futures(args))
  File "C:\Anaconda\envs\Py37\lib\site-packages\joblib\_dask.py", line 292, in maybe_to_futures
    hash=False
  File "C:\Anaconda\envs\Py37\lib\site-packages\distributed\client.py", line 2085, in _scatter
    timeout=timeout,
  File "C:\Anaconda\envs\Py37\lib\site-packages\distributed\core.py", line 883, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "C:\Anaconda\envs\Py37\lib\site-packages\distributed\core.py", line 666, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "C:\Anaconda\envs\Py37\lib\site-packages\distributed\comm\tcp.py", line 201, in read
    convert_stream_closed_error(self, e)
  File "C:\Anaconda\envs\Py37\lib\site-packages\distributed\comm\tcp.py", line 125, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc)) from exc
distributed.comm.core.CommClosedError: in <closed TLS>: Stream is closed

(everything after the first line repeats x40 but identically)

On a hunch, I turned off Windows Firewall, and that seemed to resolve it (it's still running, but it looks like the cluster is actually working based on the dashboard and I've received no errors, so if it fails that's a different problem). But is there guidance somewhere as to how to avoid this? It's not in the getting started docs and I don't see anything under the security or FAQ docs. And hopefully a more nuanced solution is possible - turning off the firewall entirely isn't really an ideal workaround for obvious reasons.

Thanks!

mrocklin commented 3 years ago

Hrm, interesting. Do you happen to know if your firewall blocks high random ports? We're currently moving data through 8786, but know that at some point this is likely to be an issue.

(also, hi @therriault !)

therriault commented 3 years ago

I don't know the specifics - I'm using the default Windows 10 Firewall and haven't done any particular rule customization, so it's pretty much whatever the stock restriction are. This isn't my area of expertise, but if you can give me specific things to check I can look.

mrocklin commented 3 years ago

A few minutes of googling "windows 10 default firewall ports" didn't give a clear list. I may spin up my dual boot at some point and try it out, but only after my TODO list is sufficiently low.

I've been looking for a reason to solve the high port issue though, so I may just pull the trigger on a solution regardless. I have some Dask over websockets code hidden away somewhere that it would be nice to publish.

mrocklin commented 3 years ago

(which would let us run over ports 80/443)

mrocklin commented 3 years ago

@marcosmoyano the next time you have some free time (probably in a week or two?) maybe now would be a good opportunity to move https://github.com/coiled/dask-ws over to github.com/dask/distributed (see the comms directory). I think that that would be a start to a good long-term solution to this problem.

therriault commented 3 years ago

Thanks, curious to see if that's solvable. Also, in the meantime - @mrocklin we've got a slack thread in the beta channel about other (maybe related, maybe not?) windows issues that you might have insight into.

mrocklin commented 3 years ago

Thanks, curious to see if that's solvable

Yeah, to unpack my comments before, my guess is that some firewalls don't like accessing remote webservers on ports other than 80 (http) and 443 (https). That's ok, Dask can host itself on 80/443, however in order to do so, it should try to look like an HTTP server. We can do this by switching from using tcp/tls sockets to using ws/wss websockets. This will allow the Dask scheduler to look just like any other ordinary website like google.com or wikipedia.com to a persnickety firewall.

we've got a slack thread in the beta channel about other (maybe related, maybe not?) windows issues that you might have insight into

I'll take a look. I think that @necaris was pinging me about this earlier and I'm chatting with him in a bit

therriault commented 3 years ago

Just one postscript here: using the dask-ml backend instead of joblib, no need to drop the firewall. Not sure what's different, but the good news is that that' seems like a perfectly serviceable alternative.

mrocklin commented 3 years ago

Thanks actually really useful to know. Thanks Andrew for following up here.

On Mon, Nov 2, 2020 at 2:38 PM Andrew Therriault notifications@github.com wrote:

Just one postscript here: using the dask-ml backend instead of joblib, no need to drop the firewall. Not sure what's different, but the good news is that that' seems like a perfectly serviceable alternative.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-issues/issues/86#issuecomment-720764467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTG5FPPCM2ZQPR5H3BDSN4YG7ANCNFSM4TBDZ5NA .

shughes-uk commented 1 year ago

We allow people to mess with scheduler ports and such now. Seems like we have this resolved.