dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

ValueError: invalid operation on non-started TCPListener when ending script #4477

Open orioltinto opened 3 years ago

orioltinto commented 3 years ago

When I distribute the work using SLURMCluster to launch some workers, I get this error when the program exits.

tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x7f284335f5e0>, <Task finished name='Task-1891' coro=<BaseTCPListener._handle_stream() done, defined at .../venv/lib/python3.8/site-packages/distributed/comm/tcp.py:459> exception=ValueError('invalid operation on non-started TCPListener')>)
Traceback (most recent call last):
  File "/software/opt/bionic/x86_64/python/3.8-2020.11/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/software/opt/bionic/x86_64/python/3.8-2020.11/lib/python3.8/site-packages/tornado/tcpserver.py", line 327, in <lambda>
    gen.convert_yielded(future), lambda f: f.result()
  File ".../venv/lib/python3.8/site-packages/distributed/comm/tcp.py", line 465, in _handle_stream
    logger.debug("Incoming connection from %r to %r", address, self.contact_address)
  File ".../venv/lib/python3.8/site-packages/distributed/comm/tcp.py", line 501, in contact_address
    host, port = self.get_host_port()
  File ".../venv/lib/python3.8/site-packages/distributed/comm/tcp.py", line 482, in get_host_port
    self._check_started()
  File ".../venv/lib/python3.8/site-packages/distributed/comm/tcp.py", line 457, in _check_started
    raise ValueError("invalid operation on non-started TCPListener")
ValueError: invalid operation on non-started TCPListener

It looks like all the computation its actually done, but I don't manage to understand why I get this error.

As suggested in this issue I tried to use a context manager to see if this would help, but I obtain the same issue.

The actual work consists on delayed tasks that read and write files using xarray and the error does not happen if not using SLURMCluster to initiate the workers.

pyrito commented 3 years ago

Hello,

I also seem to be getting the exact same error. My job seems to be complete however...

Thanks!

pl-marasco commented 2 years ago

@orioltinto

I'm facing a similar situation over a PBSCluster doing almost the same that you are describing. Delayed tasks reading and writing xarray on a Zarr file. did you manage to find the origin of this issue or a workaround? In my case I'm on a python 3.10 so seems that is something not version related.

etejedor commented 1 year ago

I get the same with HTCondorCluster and TLS configured.