Closed anand-ranganathan closed 8 years ago
Currently there is no reconnection policy other than using RabbitMQ's built in network recovery here: https://www.rabbitmq.com/api-guide.html#recovery
If the operator loses connection to the RabbitMQ server after a connection has already been established, then the operator will not crash, it will simply retry establishing the connection every 5 seconds (if automatic recovery is enabled).
The operator currently should only crash if there is not connection to the RabbitMQ host on startup. Is this what you have been seeing?
yes, operator crashes at the startup. Is it possible for it to not crash during initialization and instead demonstrate behavior of trying to reconnect every 5 sec?
It would be possible to do this, but I'm not positive how desirable it is. As a developer, I generally want to see my operator crash if it can't make the initial connection to a server. This behavior is consistent with the Kafka operators, although different to JMS in my understanding.
Does the concern center around the missing connection looking like a Streams runtime problem, when in fact it's about the server being unreachable?
I think this is a request to implement reconnection policy with the RabbitMQ operators. With JMS, MQTT and XMS, they all implement a consistent reconnection policy that can be configured by the end user. The fact that the operator crashes during initialization prevents it from attempting to reconnect when a connection has failed.
In MQTT, we implemented connection and reconnection in the process method, to make sure that the operator keeps trying to connect until the reconnection policy has expired. When the reconnection policy has expired, that the operator should crash.
Prior to Streams v4, the operator will crash and remains unhealthy. In v4, all operators have become restartable by default. This means, the operator will crash, and then restarted by default when the reconnection policy has expired. If the user does not want this to happen, customers can then set up the operators to be non-restartable.
+1 on uniform behavior for surviving transient server availability, be it at job startup (initial connect) or later on.
I totally understand what you would want to do as a developer :) - so that you could catch any defects ASAP.
In production though, SA will be monitoring 50-100 applications and they do not want to see things yellow if it is not a faulty Streams operator/application. The server might be down and a different set of SAs will see that in a different dashboard monitoring those servers. We will also be monitoring tuple flow rates at different sources and sinks so we will know that something is not right when these rates are abnormal. It will be fed into their enterprise application monitoring framework and right teams will get alerted.
So an operator being "unhealthy" in Streams console and operator/PE log shows Exception and some other dreaded messages, people start thinking that the issue is with the operator.
Anand, in such case, would an "infinite connection retry policy" satisfy your requirement? If the customer has set the operator to retry infinitely, then the operator will never crash. There will be no tuple flow which can also signal that something is wrong.
For consistency all messaging operators should have such metrics, as a definitive indication that something is wrong.
Added issue #241 for implementing metrics.
I would like to move forward with #241. Since that will let users know if they are connected or not, developers will have a good way of seeing if they are connected or not.
I propose adding a networkRecoveryInterval parameter that allows users to configure the RabbitMQ client reconnection interval in the case of network failure: factory.setNetworkRecoveryInterval(networkRecoveryInterval);
We will use that same parameter to set the timeout between attempts of setting up the initial connection. If the autorecovery is set to false, then we will not reattempt the initial connection either.
Ideally, I would like to see the following behavior: