[SOL-79060] Health indicator should capture binding health statuses

GreenRover commented 2 years ago

Scenario:

Application starts up
A queue will be provisioned
Some one deleted the queues (by a mistake)

Result:

2022-06-26T08:03:52.664+0000 ERROR Unable to send request message
org.springframework.messaging.MessagingException: Unable to send message to topic tms/monitoring/monalesy/p/v1/serviceState/request; nested exception is com.solacesystems.jcsmp.JCSMPTransportException: (JCSMPTransportException) Error receiving data from underlying connection.
    at com.solace.spring.cloud.stream.binder.util.ErrorChannelSendingCorrelationKey.send(ErrorChannelSendingCorrelationKey.java:57) ~[spring-cloud-stream-binder-solace-core-3.2.1.jar!/:?]
    at com.solace.spring.cloud.stream.binder.outbound.JCSMPOutboundMessageHandler.handleMessagingException(JCSMPOutboundMessageHandler.java:142) ~[spring-cloud-stream-binder-solace-core-3.2.1.jar!/:?]
    at com.solace.spring.cloud.stream.binder.outbound.JCSMPOutboundMessageHandler.handleMessage(JCSMPOutboundMessageHandler.java:98) ~[spring-cloud-stream-binder-solace-core-3.2.1.jar!/:?]
    at org.springframework.cloud.stream.binder.AbstractMessageChannelBinder$SendingHandler.handleMessageInternal(AbstractMessageChannelBinder.java:1074) ~[spring-cloud-stream-3.2.1.jar!/:3.2.1]
    at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:56) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.integration.dispatcher.AbstractDispatcher.tryOptimizedDispatch(AbstractDispatcher.java:115) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.integration.dispatcher.UnicastingDispatcher.doDispatch(UnicastingDispatcher.java:133) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.integration.dispatcher.UnicastingDispatcher.dispatch(UnicastingDispatcher.java:106) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.integration.channel.AbstractSubscribableChannel.doSend(AbstractSubscribableChannel.java:72) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:317) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:272) ~[spring-integration-core-5.5.8.jar!/:5.5.8]
    at org.springframework.cloud.stream.function.StreamBridge.send(StreamBridge.java:222) ~[spring-cloud-stream-3.2.1.jar!/:3.2.1]
    at org.springframework.cloud.stream.function.StreamBridge.send(StreamBridge.java:164) ~[spring-cloud-stream-3.2.1.jar!/:3.2.1]
    at org.springframework.cloud.stream.function.StreamBridge.send(StreamBridge.java:144) ~[spring-cloud-stream-3.2.1.jar!/:3.2.1]

Expected: The SolaceBinderHealthIndicator of the application change to unhealthy.

Because mas last 10 pull requests was not merged i only create an issue

GreenRover commented 2 years ago

Ping @Mrc0113

Nephery commented 2 years ago

The solace binder health indicator currently only captures the health of the binder's PubSub+ session. It does not currently capture any binding health statuses since their flows were configured to try reconnecting forever. But it seems that some cases (like this one) aren't captured by the flow reconnect feature.

To add this, we'll need to do some digging around to see if there's a good way to add binding statuses to the binder's health indicator (since its not a composite health indicator) or if/how other SCSt binders capture binding health statuses.

If there's no good way to add this to the existing indicator, we might have to add some custom config option to change the solace binder health indicator into a composite health indicator, which when enabled, would show the health statuses for both the binder's session as well as for all its bindings.

mackenza commented 2 years ago

This has been logged in the Solace Jira and we are targeting Q1CY2023 for a fix.

Nephery commented 2 years ago

@GreenRover Just double checking, but did you post the correct stacktrace?

Looking again at the one you posted, this looks like an error on the producer side. But you shouldn't get this error by just deleting the queue. Deleting the queue should result in a different error, and it would be one on the consumer.

I was only able to reproduce a similar stacktrace by killing the session (e.g. session reconnect attempts exhausted). But the health for that is already captured by the existing health indicator (i.e. the PubSub+ session health).

GreenRover commented 2 years ago

I retestet now with:

2.3.2 2021.0.4 A "deleted queue" gives following log message: 2022-10-25T07:33:10.381+0200 INFO Client-1: Error Response (503) - Service Unavailable; subCode: 50; flowId=7 2022-10-25T07:33:10.410+0200 INFO Client-1: Got BIND ('scst/wk/last_value/plain/last_value/state/_') Error Response (503) - Unknown Queue But health is still "UP". --> _Should be "DOWN"_ Not connected but reconnecting: Health is still "UP". (sub status: RECONNECTING) --> _OK as expected_ Sending via StreamBridge is blocking as long in status=RECONNECTING --> _OK as expected_ Not connected end of reconnectes: Health is still "DOWN". (sub status: DOWN) --> _OK as expected_ org.springframework.messaging.MessagingException: Unable to send message to topic tms/monitoring/monalesy/p/v1/serviceState/request; nested exception is com.solacesystems.jcsmp.JCSMPTransportException: (JCSMPTransportException) Error receiving data from underlying connection. I am at the moment not able to reproduce. Have just seen this while a not 100% understood production issue.

Nephery commented 2 years ago

Is the queue that you're deleting the one the input binding is consuming messages from? Or do you have a queue subscribed to the output binding destination, tms/monitoring/monalesy/p/v1/serviceState/request, and that is the queue you are deleting?

GreenRover commented 1 year ago

the queue i delete is the from the input binding

GreenRover commented 1 year ago

With release 2.5.0 when a queue was deleted manually the service logs:

2023-05-05T10:46:20.725+0200 WARN Received error while trying to read message from endpoint scst/wk/sensor.XXX/plain/sensor/FOOO/_/_
com.solacesystems.jcsmp.JCSMPErrorResponseException: 503: Unknown Queue
    at com.solacesystems.jcsmp.impl.flow.BindRequestTask.execute(BindRequestTask.java:211) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.impl.flow.SubFlowManagerImpl.handleAssuredCtrlMessage(SubFlowManagerImpl.java:570) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.impl.TcpClientChannel.handleAssuredCtrlMsg(TcpClientChannel.java:1768) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.impl.TcpClientChannel.handleMessage(TcpClientChannel.java:1733) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.nio.impl.SubscriberMessageReader.processRead(SubscriberMessageReader.java:98) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.nio.impl.SubscriberMessageReader.read(SubscriberMessageReader.java:140) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.smf.SimpleSmfClient.read(SimpleSmfClient.java:1206) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.nio.impl.SyncEventDispatcherReactor.processReactorChannels(SyncEventDispatcherReactor.java:206) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.nio.impl.SyncEventDispatcherReactor.eventLoop(SyncEventDispatcherReactor.java:157) ~[sol-jcsmp-10.16.0.jar:?]
    at com.solacesystems.jcsmp.protocol.nio.impl.SyncEventDispatcherReactor$SEDReactorThread.run(SyncEventDispatcherReactor.java:338) ~[sol-jcsmp-10.16.0.jar:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]

But /actuator/health is still in status up. And the application is forever a zombi. Because neither the queues will be recreated nor the application will going to be killed.

mackenza commented 1 year ago

@GreenRover , apologies for taking so long to address this issue. Did you also log an RT for this? If so, do you know the ticket reference #?

GreenRover commented 1 year ago

Hello Andreaw, no there is no related RT. ReTest is succesfully.

SolaceProducts / solace-spring-cloud

[SOL-79060] Health indicator should capture binding health statuses #145