Refinitiv / Real-Time-SDK

Other
191 stars 128 forks source link

Reactor shutdown for dead channel #46

Closed sam-truscott closed 6 years ago

sam-truscott commented 6 years ago

Hi,

We're using Elektron-SDK 1.1.1 and had a channel down event and the ChannelGroup contained a box that happened to be down. We actually had a few down. At startup the dead box is skipped and a good box is found.

However, after that box had an issue, it moves through the ChannelGroup and found another dead box at which point the Reactor gets a failure and the library throws exceptions.

Here are the logs - some details (e.g. name) have been removed.

// by this point we were having a resync - was on channel 3 or 4
StatusMsg
    streamId="76433"
    domain="MarketPrice Domain"
    state="Open / Suspect / None / 'Service not available'"
    name="----"
    serviceName="ELEKTRON_AD"
StatusMsgEnd

// channel goes down or is dead
loggerMsg
    ClientName: ChannelCallbackClient
    Severity: Warning
    Text:    Received ChannelDownReconnecting event on channel Channel_5
    RsslReactor Channel is null
    Error Id 0
    Internal sysError 0
    Error Location Reactor.processWorkerEvent
    Error text Reconnection failed: java.nio.channels.UnresolvedAddressException
loggerMsgEnd

// [assumption] caused by exception above
loggerMsg
    ClientName: ChannelCallbackClient
    Severity: Error
    Text:    Received ChannelDown event on channel Channel_5
    Instance Name Consumer_1
    RsslReactor Channel is null
    Error Id -1
    Internal sysError 0
    Error Location WlItemHandler.dispatch
    Error text ReactorCallbackReturnCodes.FAILURE was returned from defaultMsgCallback(). This caused the Reactor to shutdown.
loggerMsgEnd

// Channel goes down
StatusMsg
    streamId="1"
    domain="Login Domain"
    state="Closed / Suspect / None / 'channel closed'"
    name="-----" // contained the username
    nameType="1"
StatusMsgEnd

loggerMsg
    ClientName: SingleItem
    Severity: Error
    Text:    Internal error: ReactorChannel.submit() failed in SingleItem.submit(CloseMsg)RsslChannel 0
    Error Id -1
    Internal sysError 0
    Error Location ReactorChannel.submit
    Error Text Reactor is shutdown, submit aborted.
loggerMsgEnd

This is followed by unregister/register messages throwing the following exception:

com.thomsonreuters.ema.access.OmmInvalidUsageExceptionImpl: Failed to close item request. Reason: ReactorReturnCodes.FAILURE. Error text: Reactor is shutdown, submit aborted.
    at com.thomsonreuters.ema.access.OmmBaseImpl.ommIUExcept(OmmBaseImpl.java:1172)
    at com.thomsonreuters.ema.access.OmmConsumerImpl.handleInvalidUsage(OmmConsumerImpl.java:419)
    at com.thomsonreuters.ema.access.SingleItem.rsslSubmit(ItemCallbackClient.java:3018)
    at com.thomsonreuters.ema.access.SingleItem.close(ItemCallbackClient.java:2845)
    at com.thomsonreuters.ema.access.ItemCallbackClient.unregister(ItemCallbackClient.java:2247)
    at com.thomsonreuters.ema.access.OmmBaseImpl.unregister(OmmBaseImpl.java:428)
    at com.thomsonreuters.ema.access.OmmConsumerImpl.unregister(OmmConsumerImpl.java:150)
    ....

My suspicion is that the 'java.nio.channels.UnresolvedAddressException' other than at startup has caused a problem.

Sam

bberner commented 6 years ago

We have created a JIRA issue for this and are trying to reproduce. A fix will be in a future release.

bberner commented 6 years ago

We've been unable to reproduce this issue with the ESDK 1.1.1 release and the ESDK 1.2.0 release. We're assuming that since the exception happened with unregister(), you're calling unregister(). The exception simply means that the consumer is being uninitialized when unregister is called. Can you still reproduce this issue with ESDK 1.2.0? If so, can you give us very detailed instructions for reproducing?

sam-truscott commented 6 years ago

Hi Bill, at the time of the call to unregister() we weren't calling uninitialise() but the SDK was probably re-initialising itself due to the dead channel but we're unable to synchronise on part of the SDK to avoid the problem.

What should a (API) caller do to avoid this problem?

bberner commented 6 years ago

It's a serious internal error if the SDK is re-initializing itself. So, any problems like these need to be addressed and there's no way for the API caller to recover. We've fixed many bugs between 1.1.1 and 1.2.0 so it's possible it's already fixed. But since we haven't been able to reproduce, we cannot know for sure. Can you still reproduce this issue with ESDK 1.2.0? If so, can you give us very detailed instructions for reproducing?

bberner commented 6 years ago

We've spent some time trying reproduce with a ChannelSet of 4 channels. Two of these channels (1 and 3) got the UnresolvedAddressException issue and the other 2 (2 and 4) are good. Start test with channel 2 up. EMA consumer gets UnresolvedAddressException for channel 1 and then reconnects to channel 2 and gets data. Then channel 2 is disconnected and EMA consumer reconnects to channel 3 and gets UnresolvedAddressException. After that, it reconnects to channel 4 and gets data. We never see the "ReactorCallbackReturnCodes.FAILURE was returned from defaultMsgCallback" that caused the unregister() issue. Please let us know if this is a correct test to reproduce the issue.