Open gjoseph92 opened 3 years ago
While there are many servers in play (#scheduler x #workers) either server seems to be only busy with a rather limited amount of connections in the order of about a hundred simultaneous connection attempts (e.g. each scheduler will negotiate a connection to the workers plus your client). We actually allow many of simultaneous connections at once (see socket-backlog
config option which is set to 2K by default). This backlog size has been introduced in https://github.com/dask/distributed/pull/1232
Are you always seeing the Timed out trying to connect to to
message or do you also see Timed out during handshake while connecting to
?
If there is a more or less homogeneous distribution, this might be genuinely due to load on the respective schedulers. the latter one is raise if there is an established connection but the metadata handshake fails. that handshake is really cheap but still requires a few cycles. if the scheduler on the other end is really busy this might not be sufficient.
If it's just the first one, as you put it in the issue, I'm wondering if it's a low level / kernel problem. I'm also wondering if we shouldn't restrict the size of the socket backlog and rather deal with connection rejections since they are now properly retries with exponential backoffs and jitters anyhow.
I would probably try lowering that backlog size and see if this helps (netutils sets a default at 128 ❗). If that doesn't help, I'd be curious to see TCP connection logs but getting those is a bit more involved (e.g. using iptables)
@gjoseph92 IIRC this was a coiled issue. I vaguely remember workers not starting fast enough and the scheduler would be dead before wait_for_workers
finishes.
While running a profiling script that creates ~180 Coiled clusters (max 4 active at once), I'm hitting somewhat frequent TCP connection timeout errors during
wait_for_workers
, even withDASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s
. Anecdotally, I've noticed it most commonly happens with my 50- or 100-worker clusters, less so with the 2- or 20-worker ones.I've also tried with https://github.com/dask/distributed/pull/5096 and I'm still seeing the errors, though I can't say statistically if they're any less frequent.
Maybe notable is that I'm using 4 async Clients at once from the same process.
I’m not sure if this is a Dask problem, or a Coiled problem, or both. Wondering if anyone here has ideas on what to debug next. cc @fjetter @jacobtomlinson
Typical traceback: