Why are you submitting this feature request
Currently if the connection to the RabbitMQ server fails, and error occurs which we catch and attempt to resolve by a single reconnect. If this reconnect fails we shutdown the process. In certain situations it is entirely plausible the first reconnect attempt fails. I'll give an example:
The L2 network experiences a failure, and a root bridge switch in STP fails. There is a 4-15 second delay until the STP network reassembles itself, and traffic can flow. If this kind of thing happens with the current setup, the queues will fail to reconnect and the process will stop.
Describe the solution you'd like
What we should do is enable a lock on all queues, and API's, where if we experience a RabbitMQ disconnection, we block all new calls, and attempt to reconnect to the RabbitMQ server.
We will also need to modify the /v2/systems/check route to return a non Status OK (http 200) response, into a 503 service unavailable response. This will alllow load balancing systems to route requests around the failing API instance.
Whenever we fail to reconnect to RabbitMQ we should do some kind of exponential backoff retry. And if we fail, say 10 times in a row just stop completely.
Describe alternatives you've considered
Restarting hosts automatically via docker, this isn't exactly nice.
Why are you submitting this feature request Currently if the connection to the RabbitMQ server fails, and error occurs which we catch and attempt to resolve by a single reconnect. If this reconnect fails we shutdown the process. In certain situations it is entirely plausible the first reconnect attempt fails. I'll give an example:
The L2 network experiences a failure, and a root bridge switch in STP fails. There is a 4-15 second delay until the STP network reassembles itself, and traffic can flow. If this kind of thing happens with the current setup, the queues will fail to reconnect and the process will stop.
Describe the solution you'd like
What we should do is enable a lock on all queues, and API's, where if we experience a RabbitMQ disconnection, we block all new calls, and attempt to reconnect to the RabbitMQ server.
We will also need to modify the
/v2/systems/check
route to return a non Status OK (http 200) response, into a 503 service unavailable response. This will alllow load balancing systems to route requests around the failing API instance.Whenever we fail to reconnect to RabbitMQ we should do some kind of exponential backoff retry. And if we fail, say 10 times in a row just stop completely.
Describe alternatives you've considered Restarting hosts automatically via docker, this isn't exactly nice.
Additional context N/A