IBMStreams / streamsx.messaging

This toolkit is focused on interacting with popular messaging systems such as Kafka, JMS, XMS, and MQTT. After release v5.4.2 the complete toolkit will be deprecated. See the README.md file for hints to alternative toolkits.

http://ibmstreams.github.io/streamsx.messaging/

Apache License 2.0

30 stars 32 forks source link

RabbitMQ Source Operator becomes unhealthy when server is not reacheable #234

Closed anand-ranganathan closed 8 years ago

anand-ranganathan commented 8 years ago

Ideally, I would like to see the following behavior:

There should be some reconnection policy - and operator should continually (or not) try according to this policy. The operator should not simply crash.
Current operator crashes and then gets restarted by Streams runtime, instead, the operator should not crash no matter what and instead log the condition and why it is stopping to try to reconnect.

Alex-Cook4 commented 8 years ago

Currently there is no reconnection policy other than using RabbitMQ's built in network recovery here: https://www.rabbitmq.com/api-guide.html#recovery

If the operator loses connection to the RabbitMQ server after a connection has already been established, then the operator will not crash, it will simply retry establishing the connection every 5 seconds (if automatic recovery is enabled).

The operator currently should only crash if there is not connection to the RabbitMQ host on startup. Is this what you have been seeing?

anand-ranganathan commented 8 years ago

yes, operator crashes at the startup. Is it possible for it to not crash during initialization and instead demonstrate behavior of trying to reconnect every 5 sec?

Alex-Cook4 commented 8 years ago

It would be possible to do this, but I'm not positive how desirable it is. As a developer, I generally want to see my operator crash if it can't make the initial connection to a server. This behavior is consistent with the Kafka operators, although different to JMS in my understanding.

Does the concern center around the missing connection looking like a Streams runtime problem, when in fact it's about the server being unreachable?

chanskw commented 8 years ago

I think this is a request to implement reconnection policy with the RabbitMQ operators. With JMS, MQTT and XMS, they all implement a consistent reconnection policy that can be configured by the end user. The fact that the operator crashes during initialization prevents it from attempting to reconnect when a connection has failed.

In MQTT, we implemented connection and reconnection in the process method, to make sure that the operator keeps trying to connect until the reconnection policy has expired. When the reconnection policy has expired, that the operator should crash.

Prior to Streams v4, the operator will crash and remains unhealthy. In v4, all operators have become restartable by default. This means, the operator will crash, and then restarted by default when the reconnection policy has expired. If the user does not want this to happen, customers can then set up the operators to be non-restartable.

dlaboss commented 8 years ago

+1 on uniform behavior for surviving transient server availability, be it at job startup (initial connect) or later on.

anand-ranganathan commented 8 years ago

I totally understand what you would want to do as a developer :) - so that you could catch any defects ASAP.

In production though, SA will be monitoring 50-100 applications and they do not want to see things yellow if it is not a faulty Streams operator/application. The server might be down and a different set of SAs will see that in a different dashboard monitoring those servers. We will also be monitoring tuple flow rates at different sources and sinks so we will know that something is not right when these rates are abnormal. It will be fed into their enterprise application monitoring framework and right teams will get alerted.

So an operator being "unhealthy" in Streams console and operator/PE log shows Exception and some other dreaded messages, people start thinking that the issue is with the operator.

chanskw commented 8 years ago

Anand, in such case, would an "infinite connection retry policy" satisfy your requirement? If the customer has set the operator to retry infinitely, then the operator will never crash. There will be no tuple flow which can also signal that something is wrong.

ddebrunner commented 8 years ago

225 just added metrics for MQTT operators that indicate if the operator is connected or not.

For consistency all messaging operators should have such metrics, as a definitive indication that something is wrong.

chanskw commented 8 years ago

Added issue #241 for implementing metrics.

Alex-Cook4 commented 8 years ago

I would like to move forward with #241. Since that will let users know if they are connected or not, developers will have a good way of seeing if they are connected or not.

I propose adding a networkRecoveryInterval parameter that allows users to configure the RabbitMQ client reconnection interval in the case of network failure: factory.setNetworkRecoveryInterval(networkRecoveryInterval);

We will use that same parameter to set the timeout between attempts of setting up the initial connection. If the autorecovery is set to false, then we will not reattempt the initial connection either.