Open ximon18 opened 2 months ago
I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all
and read_exact
will already cover ErrorKind::Interrupted
. ErrorKind::TimedOut
should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.
Retrying when the configured response_write_timeout
feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1)
, which is a bit unexpected.
Looking up ErrorKind::TimedOut
, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.
This PR contains a couple of fixes for TCP connection error handling:
It also adds a metric for the number of aborted writes that occur.
These issues were discovered when dig disconnected because a large AXFR transfer was slow to start and the default dig 5 second timeout expired.
Interestingly these issues were easier to fix in the WIP
xfr
branch that I found them on as that is based on theservice-layering
branch which simplifies the response handling.