EventStore / EventStoreDB-Client-Java

Official Asynchronous Java 8+ Client Library for EventStoreDB 20.6+
https://eventstore.com
Apache License 2.0
61 stars 19 forks source link

Should DNS discovery use all IPs in a multi-address DNS name as cluster seeds? #273

Open lbodor opened 2 months ago

lbodor commented 2 months ago

My experience is that DNS discovery fails when 1 node out of 3 is down, and discovery spends all of maxDiscoverAttempts trying to get gossip from the node that is down, instead of also considering the other 2 nodes' IPs registered with a multi-address DNS name.

I was able to implement the behaviour I expect like this

# ClusterDiscovery.java:

    void discover(ConnectionState state) {
-       List<InetSocketAddress> candidates = new ArrayList<>(this.seeds);
+       List<InetSocketAddress> candidates = new ArrayList<>();
+
+       if (state.getSettings().isDnsDiscover()) {
+           try {
+               InetSocketAddress dnsSeed = this.seeds.get(0);
+
+               // Resolve cluster DNS name
+               candidates = Arrays.stream(InetAddress.getAllByName(dnsSeed.getHostName()))
+                   .map(addr -> new InetSocketAddress(addr, dnsSeed.getPort()))
+                   .collect(Collectors.toList());
+                  
+           } catch (UnknownHostException e) {
+               throw new UncheckedIOException(e);
+           }
+       } else {
+           candidates = new ArrayList<>(this.seeds);
+       }

        if (candidates.size() > 1) {

I'm not sure, however, if you'd prefer to delegate somehow this behaviour to the gRPC client, since it's the gRPC client that currently does the lookup.

YoEight commented 2 months ago

Hey @lbodor,

If what you describe is true then that's a bug on our part. You shouldn't have to do all this. Let me get back to you after I conduct some investigation.

Thanks for taking the time to reach out.

YoEight commented 1 month ago

Hey @lbodor

I did my investigation on the matter and I'd would like you to confirm a few things first. Did you set your connection string with esdb+discover:// or if you use the builder configuration, did you set dnsDiscover(true) and submitted more than one endpoint/seed as well?

We used to support A DNS queries a long time ago when there were only TCP clients. We stopped doing it because configuring a DNS properly is not given to everybody. A suggestion in your case would be to register all your nodes in your DNS like you did but to have your DNS to pick randomly/roundrobin a node when the main domain is queried.

lbodor commented 1 month ago

Thanks for getting back to me. Here is how I connect.

EventStoreDBClient.create(
     EventStoreDBClientSettings.builder()
        .dnsDiscover(true)
        .addHost(hostname, 443)
        .tls(true)
        .buildConnectionSettings()
);

Since DNS discovery is true and hostname resolves to 3 IP addresses, I'd expect all 3 to be used as gossip seeds. This is documented for cluster-side node discovery (https://developers.eventstore.com/server/v24.2/cluster.html#cluster-with-dns), and it seems fair for it to also work in the client.

Thanks for suggesting round-robin, but I think it would be generally unreliable, since it would require short TTL, which recursive DNS servers can ignore, if they consider it too short. I have tried it in Route53, and for the same DNS configuration, I'm getting very different results between running the client at work vs at home, probably due to different caching behaviour of recursive DNS servers between the client and the authoritative DNS. It would work, if users were to configure DNS resolution on the client to go straight to the authoritative DNS server.

What is the downside of applying something like the above patch to ClusterDiscovery.java? If multivalue records are meant to work during cluster-side node discovery, then that is for the benefit of users who have already managed to configure DNS correctly. It seems a smaller requirement of users' DNS skills, than getting round-robin, TTL, discovery timeouts, and path to authoritative DNS all to line up.

YoEight commented 3 weeks ago

Hey @lbodor

Since DNS discovery is true and hostname resolves to 3 IP addresses, I'd expect all 3 to be used as gossip seeds. This is documented for cluster-side node discovery (https://developers.eventstore.com/server/v24.2/cluster.html#cluster-with-dns), and it seems fair for it to also work in the client.

And that should be the case. Could you provide some logs showing the client not going for other members of the cluster if it fails to connect to the first seed? By logs, I mean those emitted by the java client.

lbodor commented 3 weeks ago
$ host nodes.dev-xxx
nodes.dev-xxx has address 13.55.106.44
nodes.dev-xxx has address 3.104.208.149
nodes.dev-xxx has address 52.64.13.213

Node 13.55.106.44 is down, others are up. The issue is present when DNS resolves with 13.55.106.44 as first item in the list.

2024-07-08 10:39:35,963 [DEBUG] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] Start connection attempt (1/3)
2024-07-08 10:39:35,963 [DEBUG] [] [] [] [ForkJoinPool.commonPool-worker-1] [com.eventstore.dbclient.ClusterDiscovery] Using seed node [nodes.dev-xxx/13.55.106.44:443] for cluster node discovery.
2024-07-08 10:39:38,054 [ERROR] [] [] [] [ForkJoinPool.commonPool-worker-1] [com.eventstore.dbclient.ClusterDiscovery] java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 1.941476417s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[buffered_nanos=1944060611, waiting_for_connection]]]
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
    at com.eventstore.dbclient.ClusterDiscovery.discover(ClusterDiscovery.java:60)
    at com.eventstore.dbclient.ClusterDiscovery.lambda$run$2(ClusterDiscovery.java:42)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 1.941476417s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[buffered_nanos=1944060611, waiting_for_connection]]]
    at io.grpc.Status.asRuntimeException(Status.java:533)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)
Exception during the node selection process
2024-07-08 10:39:38,059 [ERROR] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] java.util.concurrent.ExecutionException: com.eventstore.dbclient.NoClusterNodeFoundException
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
    at com.eventstore.dbclient.ConnectionService.createChannel(ConnectionService.java:130)
    at com.eventstore.dbclient.ConnectionService.process(ConnectionService.java:170)
    at com.eventstore.dbclient.RunWorkItem.accept(RunWorkItem.java:30)
    at com.eventstore.dbclient.ConnectionService.run(ConnectionService.java:46)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: com.eventstore.dbclient.NoClusterNodeFoundException: null
    at com.eventstore.dbclient.ClusterDiscovery.discover(ClusterDiscovery.java:79)
    at com.eventstore.dbclient.ClusterDiscovery.lambda$run$2(ClusterDiscovery.java:42)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Error when running discovery process
2024-07-08 10:39:38,060 [DEBUG] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] Start connection attempt (2/3)
2024-07-08 10:39:38,061 [DEBUG] [] [] [] [ForkJoinPool.commonPool-worker-1] [com.eventstore.dbclient.ClusterDiscovery] Using seed node [nodes.dev-xxx/13.55.106.44:443] for cluster node discovery.
2024-07-08 10:39:40,071 [ERROR] [] [] [] [ForkJoinPool.commonPool-worker-1] [com.eventstore.dbclient.ClusterDiscovery] java.util.concurrent.TimeoutException: null
    at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
    at com.eventstore.dbclient.ClusterDiscovery.discover(ClusterDiscovery.java:60)
    at com.eventstore.dbclient.ClusterDiscovery.lambda$run$2(ClusterDiscovery.java:42)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Exception during the node selection process
2024-07-08 10:39:40,072 [ERROR] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] java.util.concurrent.ExecutionException: com.eventstore.dbclient.NoClusterNodeFoundException
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
    at com.eventstore.dbclient.ConnectionService.createChannel(ConnectionService.java:130)
    at com.eventstore.dbclient.ConnectionService.process(ConnectionService.java:170)
    at com.eventstore.dbclient.RunWorkItem.accept(RunWorkItem.java:30)
    at com.eventstore.dbclient.ConnectionService.run(ConnectionService.java:46)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: com.eventstore.dbclient.NoClusterNodeFoundException: null
    at com.eventstore.dbclient.ClusterDiscovery.discover(ClusterDiscovery.java:79)
    at com.eventstore.dbclient.ClusterDiscovery.lambda$run$2(ClusterDiscovery.java:42)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Error when running discovery process
2024-07-08 10:39:40,074 [DEBUG] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] Start connection attempt (3/3)
2024-07-08 10:39:40,074 [DEBUG] [] [] [] [ForkJoinPool.commonPool-worker-1] [com.eventstore.dbclient.ClusterDiscovery] Using seed node [nodes.dev-xxx/13.55.106.44:443] for cluster node discovery.
2024-07-08 10:39:42,078 [ERROR] [] [] [] [ForkJoinPool.commonPool-worker-1] [com.eventstore.dbclient.ClusterDiscovery] java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 1.997705374s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[buffered_nanos=1999385336, waiting_for_connection]]]
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
    at com.eventstore.dbclient.ClusterDiscovery.discover(ClusterDiscovery.java:60)
    at com.eventstore.dbclient.ClusterDiscovery.lambda$run$2(ClusterDiscovery.java:42)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 1.997705374s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[buffered_nanos=1999385336, waiting_for_connection]]]
    at io.grpc.Status.asRuntimeException(Status.java:533)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)
Exception during the node selection process
2024-07-08 10:39:42,080 [ERROR] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] java.util.concurrent.ExecutionException: com.eventstore.dbclient.NoClusterNodeFoundException
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
    at com.eventstore.dbclient.ConnectionService.createChannel(ConnectionService.java:130)
    at com.eventstore.dbclient.ConnectionService.process(ConnectionService.java:170)
    at com.eventstore.dbclient.RunWorkItem.accept(RunWorkItem.java:30)
    at com.eventstore.dbclient.ConnectionService.run(ConnectionService.java:46)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: com.eventstore.dbclient.NoClusterNodeFoundException: null
    at com.eventstore.dbclient.ClusterDiscovery.discover(ClusterDiscovery.java:79)
    at com.eventstore.dbclient.ClusterDiscovery.lambda$run$2(ClusterDiscovery.java:42)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
Error when running discovery process
2024-07-08 10:39:42,081 [ERROR] [] [] [] [esdb-client-4ff4b4e0-38a4-4030-a0ef-be30fde11ae6] [com.eventstore.dbclient.ConnectionService] Maximum discovery attempt count reached: 3