Closed ddebrunner closed 3 years ago
+1 to
I would say the operator should fail (throw an exception) once it cannot reconnect, thus highlighting to any monitoring tools that there is an issue with the application.
I would consider it ideal for users to have unlimited reconnection attempts and instead monitor the "isConnected" metric, but I agree that while in general we don't want Streams operators to go unhealthy because of an external system (making it look like a Streams problem), having that option is useful. Especially since once we "cannot reconnect" it has become a Streams problem.
Looking at this line, it would seem the intention is to cause the operator to fail, need to verify what is happening.
The code in the async wrapper is slightly different, it throws a RuntimeException: https://github.com/IBMStreams/streamsx.messaging/blob/master/com.ibm.streamsx.messaging/impl/java/src/com/ibm/streamsx/messaging/mqtt/MqttAsyncClientWrapper.java#L137
which them will cause different behaviour at:
Not sure the MqttAsyncClientWrapper
is used though.
If it is unused it would be good to remove it, to make understanding the operators easier.
I can reproduce the operator not being connected after 5 retries while remaining healthy.
In this case it was a user authorization error at line 262 of MqttClientWrapper
.
+1 to
I would say the operator should fail (throw an exception) once it cannot reconnect, thus highlighting to any monitoring tools that there is an issue with the application.
If the exception message states the connection problem to the peer it should be obvious to the user that it is not a Streams problem, isn't it?
If the exception message states the connection problem to the peer it should be obvious to the user that it is not a Streams problem, isn't it?
Eventually yes, but in the meantime the users just see "failure in Streams". It depends on how proficient the user is before they even see the exception message unfortunately (at least in my experience at customers).
I think this behavior is acceptable, but I think of this parameter as pointless unless I want my operator to fail as warning sign. i.e. Why would I give up trying to connect? The failure is just a clear way to indicate the problem, rather than setting up more complex monitoring.
I did some more investigation and the operator is throwing an exception and expecting to fail (terminate) in this condition, I need to do some more digging as to why it isn't.
(this includes comments indicate the expected behaviour is to fail).
I observed the expected behavior for the MQTTSource. (Expected = retry N times, then restart until relaunch counter expired. After (relaunch-count +1) x (reconnectionBound +1) connection attempts the PE stays unhealthy). For the MQTTSink, I observed that the operator did not tried to reconnect the expected number of retries. It simply stopped retrying and remained healthy. Maybe, the MQTTSink dead-locked. This behavior is not reproducible, however. This issue needs more investigation.
Customers are already using the new toolkit https://github.com/IBMStreams/streamsx.mqtt
The MQTT operators have a parameter
reconnectionBound
that limits the number of connection retries.This optional parameter of type int32 specifies the number of successive connections that are attempted for an operator. Specify 0 for no retry, n for n number of retries, -1 for inifinite retry.
However the behaviour when the connection retry limit is reached is not documented.
The current behaviour is to just stop trying to connect leaving the operator in a healthy but totally pointless state.
I would say the operator should fail (throw an exception) once it cannot reconnect, thus highlighting to any monitoring tools that there is an issue with the application.