Some instantly recovering network blips can lead to source stalled for 80s.

MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.

Other

5.72k stars 466 forks source link

What version of Materialize are you using?

v0.92.0

What is the issue?

Due to the way we detect problematic long lived pg connections for pg sources (and potentially kafka source and mysql), a network blip which breaks the pg connection, but does not cancel the connection or send RST packets can take 80s to recover from. This appears to be due to our keep alive interval being 10s, our keep alive retry before closing a connecting being 5, and our failed source retry delay being 30s. https://github.com/MaterializeInc/materialize/blob/main/src/sql/src/session/vars/definitions.rs#L924C1-L942C3

What should we do

We should consider reducing these such that the total duration to recover from such a network blip is 30s or less. Bonus points if can get exponential backoff for source retry delays. We should also check our defaults in kafka/mysql. We may want to investigate TCP_USER_TIMEOUT alongside tcp keepalives if possible.

Concrete work items

From @benesch:

[x] Decrease our TCP keep alive count retries from 5 to 3
[ ] Adjust the storage SuspendAndRestart command to use exponential backoff, capped at 30s, instead of a fixed 30s backoff

MaterializeInc / materialize