Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.36k stars 2k forks source link

[BUG] EventHub's retry method always adds 4 seconds to a configured delay #38596

Open adriannowak opened 10 months ago

adriannowak commented 10 months ago

Describe the bug EventHub SDK fails to deliver retry messages at configured rate. Seems it just ignores the delay which is set in AmqpRetryOptions.getDelay() and by some reason adds extra time SERVER_BUSY_WAIT_TIME. This property is not configurable, and is always 4 seconds. In a consequence SDK delivers messages slower than it is expected.

Exception or Stack Trace Add the exception log and stack trace if available N/A

To Reproduce Steps to reproduce the behavior:

The following unit test should pass after at most 5 seconds (1 failure + 2 retries + timeout 3x500). With the latest version of SDK following code takes 16 seconds.

    @Test
    @Timeout(value = 5)
    void withRetryMono() {
        // Arrange
        final String timeoutMessage = "Operation timed out.";
        final AmqpRetryOptions options = new AmqpRetryOptions()
            .setMaxRetries(2)
            .setTryTimeout(Duration.ofMillis(500))
            .setMode(AmqpRetryMode.FIXED);

        final AtomicInteger resubscribe = new AtomicInteger();
        final Mono<AmqpTransportType> neverFlux = TestPublisher.<AmqpTransportType>create().mono()
            .doOnSubscribe(s -> resubscribe.incrementAndGet());

        StepVerifier.create(RetryUtil.withRetry(neverFlux, options, timeoutMessage))
            .expectSubscription()
            .expectErrorSatisfies(error -> assertTrue(error.getCause() instanceof TimeoutException))
            .verify();

        assertEquals(options.getMaxRetries() + 1, resubscribe.get());
    }

Code Snippet Add the code snippet that causes the issue.

This breaking change was introduced in this PR

https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/RetryUtil.java#L109

        final Duration delay = options.getDelay().plus(SERVER_BUSY_WAIT_TIME);

Expected behavior A clear and concise description of what you expected to happen.

SDK retries the delivery at a configured pace.

        final Duration delay = options.getDelay();

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

joshfree commented 10 months ago

Thanks for filing this issue, @adriannowak. @conniey could you take a look at this regression?

/cc @Azure/azsdk-sb-java

adriannowak commented 9 months ago

Hey @joshfree @conniey - Could you please give any update on the issue? When will be the fix available?

conniey commented 8 months ago

Thanks for reporting this. The SERVER_BUSY_WAIT_TIME should only be applied in cases where the exception is a "ServerBusyException" to align with behaviour in the legacy library.

That PR was part of a retry policy cleanup in 2021.

anuchandy commented 8 months ago

Note: given the environment is experiencing server-busy throttling from the EH service, the overall slowness may not change much since the SDK (after any fix) is still has to use back-off on server-busy.

The SDK requirement to apply back-off on server-busy is not a regression, this is a recommended pattern, which exists in all generations of Event Hubs library (including track1 3.4.x).