[ISSUE] Zeebe shows DNS errors in benchmarks

npepinpe commented 8 months ago

Describe the bug

Our benchmarks regularly report DNS errors, about ~2-3 per second. There doesn't seem to be a major impact from that yet, but we should investigate what's happening. Hypothesis right now is it's a set up issue, not an issue in Zeebe itself.

To Reproduce

Look at any recent benchmark, and lower the log level for the DNS resolver:

kubectl port-forward pod/zeebe-0 9600:9600
curl 'http://localhost:9600/actuator/loggers/io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory' -i -X POST -H 'Content-Type: application/json' -d '{"configuredLevel":"debug"}'

Expected behavior

No DNS errors during a normal run.

Log/Stacktrace

Full Stacktrace

```json { "severity": "DEBUG", "logging.googleapis.com/sourceLocation": { "function": "log", "file": "AbstractInternalLogger.java", "line": 214 }, "message": "from /10.4.0.10:53 : DefaultDnsQuestion(np-dns-fix-zeebe-2.np-dns-fix-zeebe.np-dns-fix.svc.np-dns-fix.svc.cluster.local. IN CNAME) failure", "serviceContext": { "service": "zeebe", "version": "np-dns-fix" }, "context": { "threadId": 43, "threadPriority": 5, "loggerName": "io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory", "threadName": "netty-messaging-event-epoll-client-2" }, "exception": "io.netty.resolver.dns.DnsResolveContext$DnsResolveContextException: No answer found and NXDOMAIN response code returned\n\tat io.netty.resolver.dns.DnsResolveContext.onResponse(..)(Unknown Source) ~[netty-resolver-dns-4.1.100.Final.jar:4.1.100.Final]\n", "timestampSeconds": 1698153473, "timestampNanos": 921992259 } { "severity": "DEBUG", "logging.googleapis.com/sourceLocation": { "function": "log", "file": "AbstractInternalLogger.java", "line": 191 }, "message": "DefaultDnsQuestion(np-dns-fix-zeebe-2.np-dns-fix-zeebe.np-dns-fix.svc.np-dns-fix.svc.cluster.local. IN CNAME) query never written and failed", "serviceContext": { "service": "zeebe", "version": "np-dns-fix" }, "context": { "threadId": 43, "threadPriority": 5, "loggerName": "io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory", "threadName": "netty-messaging-event-epoll-client-2" }, "exception": "io.netty.resolver.dns.DnsResolveContext$DnsResolveContextException: No name servers returned an answer\n\tat io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(..)(Unknown Source) ~[netty-resolver-dns-4.1.100.Final.jar:4.1.100.Final]\n", "timestampSeconds": 1698153473, "timestampNanos": 923597615 } ```

Environment:

Zeebe Version: 8.4.0-SNAPSHOT

megglos commented 8 months ago

ZDP-Triage:

the name it tries to resolve in the logs np-dns-fix-zeebe-2.np-dns-fix-zeebe.np-dns-fix.svc.np-dns-fix.svc.cluster.local looks suspicious
hypothesis, maybe the initial contact points are correct but maybe due to gossip some broken names are spread
we just investigate on the cause to be sure it's not a bug - time-boxed 1-2h

lenaschoenburg commented 8 months ago

In my opinion this is caused by initialContactPoints (and advertisedHost) looking like this: $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc while the resolver configuration is this: search <namespace>.svc.cluster.local svc.cluster.local cluster.local

To me this sounds like the addresses are resolved by first looking up $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc.<namespace>.svc.cluster.local then $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc.svc.cluster.local and then finally the correct $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc.cluster.local.

If I understand correctly, the fix is to simplify the way the helm charts generate addresses. I think just using $(K8S_NAME).$(K8S_SERVICE_NAME) would be ideal because it immediately results in a resolvable address: $(K8S_NAME).$(K8S_SERVICE_NAME).<namespace>.svc.cluster.local

npepinpe commented 8 months ago

Seems easy enough to test out :+1:

Onve we confirm this works, we can open a PR to the Helm repo, and ping the controller team about it (idk if this affects them yet)

Zelldon commented 8 months ago

I guess my fault https://github.com/camunda/camunda-platform-helm/commit/7a75a777e10c7da6a55c6803217a4a8966886b2a

I'm pretty sure I checked the dns names once and also this page https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-aaaa-records 🤷🏼

npepinpe commented 8 months ago

By the way, Ole's suggestion works. Setting the following fixes all DNS errors:

        - name: ZEEBE_BROKER_NETWORK_ADVERTISEDHOST
          value: $(K8S_NAME).$(K8S_SERVICE_NAME)
        - name: ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS
          value: $(K8S_SERVICE_NAME)-0.$(K8S_SERVICE_NAME):26502, $(K8S_SERVICE_NAME)-1.$(K8S_SERVICE_NAME):26502,
            $(K8S_SERVICE_NAME)-2.$(K8S_SERVICE_NAME):26502

npepinpe commented 1 month ago

Hi all, I'm transferring this issue here as customers have also reported they notice DNS errors. While the impact is quite low, it is alarming for them to see higher number of errors in their metrics, so it would be great if we could fix the initial contact points so these errors do not show up anymore.

aabouzaid commented 3 weeks ago

@npepinpe Thanks for reporting this :raised_hands: Is the solution above already tested? Or should we investigate the issue from scratch?

camunda / camunda-platform-helm

[ISSUE] Zeebe shows DNS errors in benchmarks #1754