Open jubrad opened 6 months ago
We may want to investigate TCP_USER_TIMEOUT as an alternative to tcp keepalives if possible.
Last I looked, TCP_USER_TIMEOUT
is not an alternative to keepalives. From the linked article:
On its own, [
TCP USER TIMEOUT
] doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped.
So you have to enable it in conjunction with keepalives, but doing so is delicate:
This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.
What version of Materialize are you using?
v0.92.0
What is the issue?
Due to the way we detect problematic long lived pg connections for pg sources (and potentially kafka source and mysql), a network blip which breaks the pg connection, but does not cancel the connection or send RST packets can take 80s to recover from. This appears to be due to our keep alive interval being 10s, our keep alive retry before closing a connecting being 5, and our failed source retry delay being 30s. https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/session/vars/definitions.rs#L924C1-L942C3
What should we do
We should consider reducing these such that the total duration to recover from such a network blip is 30s or less. Bonus points if can get exponential backoff for source retry delays. We should also check our defaults in kafka/mysql. We may want to investigate TCP_USER_TIMEOUT alongside tcp keepalives if possible.
Concrete work items
From @benesch: