Closed the-mikedavis closed 3 years ago
of course the first try finds us a nice :gun_error
once the changes from #14 (see also #13) were merged. as you can see in the trace above (going bottom to top in reverse chronological order):
reason: :heartbeat_timeout
:gun_error
about us not being able to send the heartbeat from 2 spans up (2 bullets down in this list):gun_error
on the sendI chaos-monkeyed this failure state by performing a rollout-restart (kubernetes) on the back-end to this front-end (the server to this client). (That back-end service has a RollingUpdate
recreation strategy)
downtime for this bug would have been minimal, but leaving the above case unhandled would allow up to the heartbeat-timeout interval in 'dead' time for the connection. by fixing this we fallback from gun retry strategies to slipstream retry strategies, which is potentially faster because of the heartbeat-timeout mechanism
:point_up: that graph there comes from https://honeycomb.io, with connection telemetry shipped by our NFIBrokerage/slipstream_honeycomb
adapter
starting to feel a need for #4 so we can understand how a connection gets in a state where this happens