apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.23k stars 3.58k forks source link

A host can't connect and won't try other hosts again in multi-hosts service url configuration #4510

Open codelipenghui opened 5 years ago

codelipenghui commented 5 years ago

Describe the bug

Use "pulsar://host1:6650,host2:6650,127.0.0.1:6650" as service url to create a pulsar client, "host1" and "host2" is the wrong hosts, not every time I can successfully connect to "127.0.0.1:6650".

To Reproduce

public void testRetryWithMultiHostServiceUrl() throws PulsarClientException {
        String serviceUrl = "pulsar://host1:6650,host2:6650,127.0.0.1:" + BROKER_PORT;
        PulsarClient client = PulsarClient.builder().serviceUrl(serviceUrl).build();
        Producer producer = client.newProducer().topic("persistent://my-property/my-ns/multi-host").create();
        assertNotNull(producer);
    }

Sometimes will get "java.net.UnknownHostException" and sometimes will connect success. Here is the stack trace while create producer failed:

org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve 'host1' after 6 queries 

    at org.apache.pulsar.client.impl.ConnectionPool.lambda$null$9(ConnectionPool.java:202)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute$$$capture(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve 'host1' after 6 queries 
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
    at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
    at org.apache.pulsar.client.impl.ConnectionPool.lambda$resolveName$16(ConnectionPool.java:259)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:485)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
    at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:856)
    at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:809)
    at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:332)
    at io.netty.resolver.dns.DnsResolveContext.onResponse(DnsResolveContext.java:495)
    at io.netty.resolver.dns.DnsResolveContext.access$400(DnsResolveContext.java:62)
    at io.netty.resolver.dns.DnsResolveContext$3.operationComplete(DnsResolveContext.java:376)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
    at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:103)
    at io.netty.resolver.dns.DnsQueryContext.setSuccess(DnsQueryContext.java:196)
    at io.netty.resolver.dns.DnsQueryContext.finish(DnsQueryContext.java:188)
    at io.netty.resolver.dns.DnsNameResolver$DnsResponseHandler.channelRead(DnsNameResolver.java:1149)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
    at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:93)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
    ... 3 more
Caused by: java.net.UnknownHostException: failed to resolve 'host1' after 6 queries 
    at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:848)
    ... 32 more

Expected behavior

Pulsar client should try to connect to each hosts, only all the hosts can't connect successfully return the error message to users.

jahidakhtargit commented 5 years ago

Same issue I'm facing:

2019-07-30 18:24:11.813 INFO 20248 --- [r-client-io-1-1] o.a.p.c.impl.ConsumerStatsRecorderImpl : Starting Pulsar consumer perf with config: { "topicNames" : [ "device-twin-topic" ], "topicsPattern" : null, "subscriptionName" : "device-twin-subscription", "subscriptionType" : "Shared", "receiverQueueSize" : 1000, "acknowledgementsGroupTimeMicros" : 100000, "negativeAckRedeliveryDelayMicros" : 60000000, "maxTotalReceiverQueueSizeAcrossPartitions" : 50000, "consumerName" : null, "ackTimeoutMillis" : 0, "tickDurationMillis" : 1000, "priorityLevel" : 0, "cryptoFailureAction" : "FAIL", "properties" : { }, "readCompacted" : false, "subscriptionInitialPosition" : "Latest", "patternAutoDiscoveryPeriod" : 1, "regexSubscriptionMode" : "PersistentOnly", "deadLetterPolicy" : null, "autoUpdatePartitions" : true, "replicateSubscriptionState" : false, "resetIncludeHead" : false } 2019-07-30 18:24:11.818 INFO 20248 --- [r-client-io-1-1] o.a.p.c.impl.ConsumerStatsRecorderImpl : Pulsar client config: { "serviceUrl" : "http://localhost:8087", "authPluginClassName" : null, "authParams" : null, "operationTimeoutMs" : 30000, "statsIntervalSeconds" : 60, "numIoThreads" : 1, "numListenerThreads" : 1, "connectionsPerBroker" : 1, "useTcpNoDelay" : true, "useTls" : false, "tlsTrustCertsFilePath" : "", "tlsAllowInsecureConnection" : false, "tlsHostnameVerificationEnable" : false, "concurrentLookupRequest" : 5000, "maxLookupRequest" : 50000, "maxNumberOfRejectedRequestPerConnection" : 50, "keepAliveIntervalSeconds" : 30, "connectionTimeoutMs" : 10000, "requestTimeoutMs" : 60000, "defaultBackoffIntervalNanos" : 100000000, "maxBackoffIntervalNanos" : 30000000000 } 2019-07-30 18:24:13.018 WARN 20248 --- [r-client-io-1-1] o.a.pulsar.client.impl.ConnectionPool : Failed to open connection to 20b59f0ed6eb:6650 : java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:13.021 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:13.021 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries -- Will try again in 0.1 s 2019-07-30 18:24:13.122 INFO 20248 --- [ulsar-timer-6-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Reconnecting after connection was closed 2019-07-30 18:24:14.266 WARN 20248 --- [r-client-io-1-1] o.a.pulsar.client.impl.ConnectionPool : Failed to open connection to 20b59f0ed6eb:6650 : java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:14.266 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:14.266 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries -- Will try again in 0.186 s 2019-07-30 18:24:14.453 INFO 20248 --- [ulsar-timer-6-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Reconnecting after connection was closed 2019-07-30 18:24:15.637 WARN 20248 --- [r-client-io-1-1] o.a.pulsar.client.impl.ConnectionPool : Failed to open connection to 20b59f0ed6eb:6650 : java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:15.637 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:15.637 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries -- Will try again in 0.4 s 2019-07-30 18:24:16.037 INFO 20248 --- [ulsar-timer-6-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Reconnecting after connection was closed 2019-07-30 18:24:17.276 WARN 20248 --- [r-client-io-1-1] o.a.pulsar.client.impl.ConnectionPool : Failed to open connection to 20b59f0ed6eb:6650 : java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:17.276 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries 2019-07-30 18:24:17.276 WARN 20248 --- [r-client-io-1-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries -- Will try again in 0.76 s 2019-07-30 18:24:18.037 INFO 20248 --- [ulsar-timer-6-1] o.a.p.client.impl.ConnectionHandler : [device-twin-topic] [device-twin-subscription] Reconnecting after connection was closed 2019-07-30 18:24:19.224 WARN 20248 --- [r-client-io-1-1] o.a.pulsar.client.impl.ConnectionPool : Failed to open connection to 20b59f0ed6eb:6650 : java.net.UnknownHostException: failed to resolve '20b59f0ed6eb' after 7 queries

Ranganaths commented 5 years ago

i am having a same problem- I am Just trying to start pulsar using bin/pulsar standalone on mac book, It throws up unknownhostexception for my mac host- any fix for this is appreciated..

yuanjingshi commented 5 years ago

I have the same problem. I have a pulsar cluster with 3 brokers, no DNS and no proxy. My producer is using multi-hosts service url to connect to the cluster, one broker is down and it doesn't try to connect to the other 2 brokers that are live. When will it be fixed?

jiazhai commented 4 years ago

I would like to contribute to this issue.

jiazhai commented 4 years ago

Thanks for @kaynewu for working on this issue

tisonkun commented 1 year ago

Here is a possible fix https://github.com/apache/pulsar/pull/18838.

Since our tests depend heavily on the unresolved logic, I write a best effort solution so that prioritize the reachable hosts in candidates.

batilak commented 1 year ago

Hi @tisonkun : can you please specify in which version of Pulsar this fix is avilable ? I dont see that in release notes , thanks

tisonkun commented 1 year ago

@batilak The PR isn't merged so there is no fixed version available yet. You may test it out and see if we can nudge it forward.