dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

TCP timeouts connecting to cluster #5099

Open gjoseph92 opened 3 years ago

gjoseph92 commented 3 years ago

While running a profiling script that creates ~180 Coiled clusters (max 4 active at once), I'm hitting somewhat frequent TCP connection timeout errors during wait_for_workers, even with DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s. Anecdotally, I've noticed it most commonly happens with my 50- or 100-worker clusters, less so with the 2- or 20-worker ones.

I've also tried with https://github.com/dask/distributed/pull/5096 and I'm still seeing the errors, though I can't say statistically if they're any less frequent.

Maybe notable is that I'm using 4 async Clients at once from the same process.

I’m not sure if this is a Dask problem, or a Coiled problem, or both. Wondering if anyone here has ideas on what to debug next. cc @fjetter @jacobtomlinson

Typical traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/tcp.py", line 376, in connect
    stream = await self.client.connect(
  File "/usr/local/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "//run.py", line 153, in run_trial
    tasks, runtime, batched_send_stats = await trial(**vars)
  File "//run.py", line 119, in trial
    await client.wait_for_workers(cluster_size)
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1219, in _wait_for_workers
    info = await self.scheduler.identity()
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 863, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1051, in connect
    raise exc
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1035, in connect
    comm = await fut
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 312, in connect
    raise OSError(
OSError: Timed out trying to connect to tls://54.188.70.54:8786 after 60 s
fjetter commented 3 years ago

While there are many servers in play (#scheduler x #workers) either server seems to be only busy with a rather limited amount of connections in the order of about a hundred simultaneous connection attempts (e.g. each scheduler will negotiate a connection to the workers plus your client). We actually allow many of simultaneous connections at once (see socket-backlog config option which is set to 2K by default). This backlog size has been introduced in https://github.com/dask/distributed/pull/1232

Are you always seeing the Timed out trying to connect to to message or do you also see Timed out during handshake while connecting to ?

If there is a more or less homogeneous distribution, this might be genuinely due to load on the respective schedulers. the latter one is raise if there is an established connection but the metadata handshake fails. that handshake is really cheap but still requires a few cycles. if the scheduler on the other end is really busy this might not be sufficient.

If it's just the first one, as you put it in the issue, I'm wondering if it's a low level / kernel problem. I'm also wondering if we shouldn't restrict the size of the socket backlog and rather deal with connection rejections since they are now properly retries with exponential backoffs and jitters anyhow.

I would probably try lowering that backlog size and see if this helps (netutils sets a default at 128 ❗). If that doesn't help, I'd be curious to see TCP connection logs but getting those is a bit more involved (e.g. using iptables)

fjetter commented 3 years ago

@gjoseph92 IIRC this was a coiled issue. I vaguely remember workers not starting fast enough and the scheduler would be dead before wait_for_workers finishes.