[BUG] Retry sending timeout after event hub upgrade recovery

yuhaii commented 3 years ago

Describe the bug We have a event hub service upgrade ended at 11/12 23:23:37 GMT. During upgrade, the retry worked well. But after upgraded, from 11/13 6:41 AM to 2:13 PM GMT, client raised sending timeout, like below. And lots of timeout exceptions has same reference id.

After restarted the client application. The sending timeout issue mitigated.

I checked client code code.zip. It didn't specify the retry policy in EventHubClient. But it configured a Spring retry mechanism to retry the "publishEvent" function if it raise exception.

@Override
@TrackExecutionTime("sendDataToEventHub")
@Retryable(value = ServiceException.class, maxAttemptsExpression = "${eh-connection-max-attempts}", backoff = @Backoff(delayExpression = "${eh-connection-retry-delay}",
        multiplierExpression = "${eh-connection-retry-multiplier}", maxDelayExpression = "${eh-connection-retry-max-delay}"))
public void publishEvent(InputHeaders inputHeaders) {
    try {
        EventHubClient eventHubClient = eventHubService.getEventHubClientBasedOnUploadTypeAndSubClientType(inputHeaders.getUploadType().toUpperCase(),
                inputHeaders.getSubClientType().toUpperCase(), inputHeaders.isAppendPayloadToEventMessage());
        eventHubService.sendEvent(eventHubClient, inputHeaders);
    } catch (Exception e) {
        log.error(EVENT_HUB_ERROR_CODE + "-retrycount-" + RetrySynchronizationManager.getContext().getRetryCount());
        throw new ServiceException(EVENT_HUB_ERROR_CODE, e.getMessage(), String.valueOf(HttpStatus.INTERNAL_SERVER_ERROR.value()), EVENT_HUB_ERROR,
                ERROR_SENDING_DATA_EVENTHUB);
    }
}

The event hub client is created from AAD auth.

    return EventHubClient.createWithAzureActiveDirectory(namespace, eventHubName, authCallback,
            authority, executorService, null).get();

Is it possible that, after restart the client application. it cleaned some cached data and create a new eventHubClient from AAD auth? Maybe it reset the ScheduledThreadPoolExecutor? If it dose not specify the retry mechanism in EventHubClient, it will use default retry. There will be two retry mechanism in such condition. Do they conflict with each other?

This is similar to the issue that we faced with blob client libraries after an AD outage. The blob storage recovered while the client library continued to fail requests due to some internal caching. Please help us fix this issue.

Exception or Stack Trace I attached the error log code.zip for reference.

_Entity(clickstream): Send operation timed out at 2020-11-13T14:13:00.180038Z[GMT]., errorContext[NS: intake-hub.servicebus.windows.net, PATH: clickstream, REFERENCE_ID: 937775f0e3fd4af585bbeb6e77b4db21_G18, LINKCREDIT: 0]

To Reproduce Steps to reproduce the behavior:

Need a service upgrade in event hub server.
After service upgrade complete, above code will raise sending timeout exception

Code Snippet See description section

Expected behavior The spring retry should made it work since event hub was working well around issue time.

Setup (please complete the following information):

OS: azure Kubernetes service. Cloud role name is “intake-service”. “scus-saprd-aks-cluster01” is the aks cluster name under resource group “scus-saie-datahub-prd-rg” and subscription “AzD1P-SaDts-Sx01”
IDE : Java, Spring, Marven
Version of the Library used com.microsoft.azure azure-eventhubs 3.2.0

joshfree commented 3 years ago

@srnagar could you please assist?

yuhaii commented 3 years ago

Hello @srnagar , please feel free to let me know if you need any information or has any question. Thank you.

rui-ss commented 3 years ago

Hi @srnagar , could you provide an update for this issue as we have been waiting for long time, thanks!

srnagar commented 3 years ago

@yuhaii and @rui-ss - this is using the older version of the Event Hubs library (3.2.0). We have a newer version of Event Hubs library - azure-messaging-eventhubs (5.3.1)) that you may want to upgrade to.

For the issue with 3.2.0 version, @JamesBirdsall may be able to provide some guidance.

JamesBirdsall commented 3 years ago

At this point it's hard to say what happened -- the continued timeouts could have been an issue on the service side. We do service upgrades routinely. Have you continued to see this behavior after more recent upgrades?

As far as stacking the built-in retry mechanism with the Spring retry mechanism, that's not a problem. The client's retry mechanism is internal: the client keeps retrying without returning (or throwing) to the caller until it succeeds, reaches the operation timeout, or gets a nonretryable error. At that point the Spring retry mechanism would come into play, effectively a second retry loop around the first inside the client.

ghost commented 3 years ago

Hi @yuhaii. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

Azure / azure-sdk-for-java

[BUG] Retry sending timeout after event hub upgrade recovery #17645