Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.35k stars 1.99k forks source link

ServiceBus Client is not surfacing connection errors #39154

Open youngm opened 8 months ago

youngm commented 8 months ago

Describe the bug If I create a ServiceBusClient 7.15.x with an invalid namespace (or something similar) then when I try to use the client to make a synchronous call I get a Timeout error instead of a ServiceBusException. With 7.14.x I immediately get a ServiceBusException.

Exception or Stack Trace The error I get is:

java.lang.IllegalStateException: Timeout on blocking read for 245600000000 NANOSECONDS
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:127)
    at reactor.core.publisher.Mono.block(Mono.java:1805)
    at com.azure.messaging.servicebus.ServiceBusSenderClient.createMessageBatch(ServiceBusSenderClient.java:207)

In the logs but it appears to be ignored by the client:

reactor.core.Exceptions$ErrorCallbackNotImplemented: com.azure.core.amqp.exception.AmqpException: errorContext[NAMESPACE: bogus. ERROR CONTEXT: N/A]
Caused by: com.azure.core.amqp.exception.AmqpException: errorContext[NAMESPACE: bogus. ERROR CONTEXT: N/A]
    at com.azure.core.amqp.implementation.ExceptionUtil.toException(ExceptionUtil.java:85)
    at com.azure.core.amqp.implementation.handler.ConnectionHandler.notifyErrorContext(ConnectionHandler.java:362)
    at com.azure.core.amqp.implementation.handler.ConnectionHandler.onTransportError(ConnectionHandler.java:259)
    at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
    at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
    at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
    at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
    at com.azure.core.amqp.implementation.ReactorExecutor.run(ReactorExecutor.java:91)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

To Reproduce Try to use a client with bad connection information like a bad namespace in a synchronous call.

Code Snippet

    var client =
        new ServiceBusClientBuilder()
            .fullyQualifiedNamespace("bogus")
            .credential(new BasicAuthenticationCredential("bogus", "auth"))
            .sender()
            .queueName("bogus-queue")
            .buildClient();
    client.createMessageBatch();

Expected behavior A quick ServiceBusException thrown

github-actions[bot] commented 8 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @EldertGrootenboer.

joshfree commented 8 months ago

@anuchandy please take a look

youngm commented 5 months ago

@anuchandy I tried this with the release in bom 1.2.23 and I experienced the same issue.

youngm commented 5 months ago

Digging a little deeper it seems that the command retries forever until it times out. For some reason, retry count isn't being respected.

youngm commented 3 months ago

bump please @anuchandy?

anuchandy commented 3 months ago

Hi @youngm, thank you for reaching out and I'm sorry for the late response.

To share some background, in 7.14.x and earlier, certain race conditions in connection-processor (the connection cache) didn't converge the downstream connection requests, that result in some (not all) downstream clients to observe the error in connection.

Those same races in the 7.14.x connection processor were affecting clients (sender, receiver, processor etc.) leading to serious recovery failures and incidents. Hence, in version 7.15.x, we revamped connection caching. In the new connection caching, we picked a simple design where it will keep retry on "retriable AMQP-Connection error", without propagating it to downstream clients. While service endpoint not reachable can happen either due to temporary network issues / service temporarily unavailable (more likely) or due to app using non-existent endpoint (less likely), the underlying stack cannot distinguish these two, so become retriable.

There can be many downstream client's AMQP-Links hosted in the shared AMQP-Connection, propagating AMQP-Connection error downstream means all clients backoff, retry requiring more coordination at the expense of more scheduling/timers bringing additional overhead. So, we had to make this trade-off of localizing the AMQP-Connection retriable-error.

Currently, we lack a solid design that address your concerns without adversely affecting common use cases, which were impacted in 7.14.x :(

youngm commented 3 months ago

Thanks for the details response @anuchandy this mostly is a problem for me when I'm testing credentials. Is there a way I can shorten this timeout when I do my connection test? Even if the settings may be sub-optimal for regular use?

anuchandy commented 3 months ago

Hi @youngm, do you mean verifying whether the credential has sufficient permission to perform an operation? If the fully qualified namespace of the Service Bus is valid and we only want to check if credential has certain permission (say "Send" permission) then we should be able do that, for example –

final String namespace = "<a-valid-namespace-name>.servicebus.windows.net";
final String policyName = "policy0"; // Policy with only "Listen" permission NO "Send".
final String key = "<key>" // key for the policy, "policy0" with only "Listen" permission NO "Send".
ServiceBusSenderClient client =
        new ServiceBusClientBuilder()
                .fullyQualifiedNamespace(namespace)
                .credential(new ServiceBusSharedKeyCredential(policyName, key))
                .sender()
                .queueName("queue0")
                .buildClient();
client.createMessageBatch();

This will throw the error -

com.azure.messaging.servicebus.ServiceBusException: Unauthorized access. 'Send' claim(s) are required to perform this operation. Resource: 'sb://<a-valid-namespace-name>.servicebus.windows.net/queue0'. TrackingId:<tracking-id>, Timestamp:<time-stamp>, errorContext[NAMESPACE: .servicebus.windows.net. ERROR CONTEXT: N/A, PATH: queue0, REFERENCE_ID: queue0, LINK_CREDIT: 0]

youngm commented 3 months ago

Thanks @anuchandy I'd also like to be able to validate the namespace. Let me know if you know some way to do that as well.

anuchandy commented 3 months ago

Hi @youngm, we can attempt namespace resolution, for example

try {
    final java.net.InetAddress ignored = java.net.InetAddress.getByName("<non-existing-namespace>.servicebus.windows.net");
} catch (java.net.UnknownHostException e) {
    System.out.println("Host resolution failed: " + e.getMessage());
}
youngm commented 3 months ago

Thanks @anuchandy for confirming that's as good as I can do at this point. I hope you will be able to fix this eventually. Thanks!