RTradeLtd / Temporal

☄️ Temporal is an easy-to-use, enterprise-grade interface into distributed and decentralized storage
https://temporal.cloud
MIT License
227 stars 40 forks source link

Enable Exponential Blocking Queue Reconnect Backoff #435

Open bonedaddy opened 4 years ago

bonedaddy commented 4 years ago

Why are you submitting this feature request Currently if the connection to the RabbitMQ server fails, and error occurs which we catch and attempt to resolve by a single reconnect. If this reconnect fails we shutdown the process. In certain situations it is entirely plausible the first reconnect attempt fails. I'll give an example:

The L2 network experiences a failure, and a root bridge switch in STP fails. There is a 4-15 second delay until the STP network reassembles itself, and traffic can flow. If this kind of thing happens with the current setup, the queues will fail to reconnect and the process will stop.

Describe the solution you'd like

What we should do is enable a lock on all queues, and API's, where if we experience a RabbitMQ disconnection, we block all new calls, and attempt to reconnect to the RabbitMQ server.

We will also need to modify the /v2/systems/check route to return a non Status OK (http 200) response, into a 503 service unavailable response. This will alllow load balancing systems to route requests around the failing API instance.

Whenever we fail to reconnect to RabbitMQ we should do some kind of exponential backoff retry. And if we fail, say 10 times in a row just stop completely.

Describe alternatives you've considered Restarting hosts automatically via docker, this isn't exactly nice.

Additional context N/A