MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Some instantly recovering network blips can lead to source stalled for 80s. #26139

Open jubrad opened 6 months ago

jubrad commented 6 months ago

What version of Materialize are you using?

v0.92.0

What is the issue?

Due to the way we detect problematic long lived pg connections for pg sources (and potentially kafka source and mysql), a network blip which breaks the pg connection, but does not cancel the connection or send RST packets can take 80s to recover from. This appears to be due to our keep alive interval being 10s, our keep alive retry before closing a connecting being 5, and our failed source retry delay being 30s. https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/session/vars/definitions.rs#L924C1-L942C3

What should we do

We should consider reducing these such that the total duration to recover from such a network blip is 30s or less. Bonus points if can get exponential backoff for source retry delays. We should also check our defaults in kafka/mysql. We may want to investigate TCP_USER_TIMEOUT alongside tcp keepalives if possible.

Concrete work items

From @benesch:

benesch commented 5 months ago

We may want to investigate TCP_USER_TIMEOUT as an alternative to tcp keepalives if possible.

Last I looked, TCP_USER_TIMEOUT is not an alternative to keepalives. From the linked article:

On its own, [TCP USER TIMEOUT] doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped.

So you have to enable it in conjunction with keepalives, but doing so is delicate:

This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.