Open npepinpe opened 8 months ago
ZDP-Triage:
np-dns-fix-zeebe-2.np-dns-fix-zeebe.np-dns-fix.svc.np-dns-fix.svc.cluster.local
looks suspiciousIn my opinion this is caused by initialContactPoints
(and advertisedHost
) looking like this: $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc
while the resolver configuration is this: search <namespace>.svc.cluster.local svc.cluster.local cluster.local
To me this sounds like the addresses are resolved by first looking up $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc.<namespace>.svc.cluster.local
then $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc.svc.cluster.local
and then finally the correct $(K8S_NAME).$(K8S_SERVICE_NAME).$(K8S_NAMESPACE).svc.cluster.local
.
If I understand correctly, the fix is to simplify the way the helm charts generate addresses. I think just using $(K8S_NAME).$(K8S_SERVICE_NAME)
would be ideal because it immediately results in a resolvable address: $(K8S_NAME).$(K8S_SERVICE_NAME).<namespace>.svc.cluster.local
Seems easy enough to test out :+1:
Onve we confirm this works, we can open a PR to the Helm repo, and ping the controller team about it (idk if this affects them yet)
I guess my fault https://github.com/camunda/camunda-platform-helm/commit/7a75a777e10c7da6a55c6803217a4a8966886b2a
I'm pretty sure I checked the dns names once and also this page https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-aaaa-records 🤷🏼
By the way, Ole's suggestion works. Setting the following fixes all DNS errors:
- name: ZEEBE_BROKER_NETWORK_ADVERTISEDHOST
value: $(K8S_NAME).$(K8S_SERVICE_NAME)
- name: ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS
value: $(K8S_SERVICE_NAME)-0.$(K8S_SERVICE_NAME):26502, $(K8S_SERVICE_NAME)-1.$(K8S_SERVICE_NAME):26502,
$(K8S_SERVICE_NAME)-2.$(K8S_SERVICE_NAME):26502
Hi all, I'm transferring this issue here as customers have also reported they notice DNS errors. While the impact is quite low, it is alarming for them to see higher number of errors in their metrics, so it would be great if we could fix the initial contact points so these errors do not show up anymore.
@npepinpe Thanks for reporting this :raised_hands: Is the solution above already tested? Or should we investigate the issue from scratch?
Describe the bug
Our benchmarks regularly report DNS errors, about ~2-3 per second. There doesn't seem to be a major impact from that yet, but we should investigate what's happening. Hypothesis right now is it's a set up issue, not an issue in Zeebe itself.
To Reproduce
Look at any recent benchmark, and lower the log level for the DNS resolver:
Expected behavior
No DNS errors during a normal run.
Log/Stacktrace
Full Stacktrace
```json { "severity": "DEBUG", "logging.googleapis.com/sourceLocation": { "function": "log", "file": "AbstractInternalLogger.java", "line": 214 }, "message": "from /10.4.0.10:53 : DefaultDnsQuestion(np-dns-fix-zeebe-2.np-dns-fix-zeebe.np-dns-fix.svc.np-dns-fix.svc.cluster.local. IN CNAME) failure", "serviceContext": { "service": "zeebe", "version": "np-dns-fix" }, "context": { "threadId": 43, "threadPriority": 5, "loggerName": "io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory", "threadName": "netty-messaging-event-epoll-client-2" }, "exception": "io.netty.resolver.dns.DnsResolveContext$DnsResolveContextException: No answer found and NXDOMAIN response code returned\n\tat io.netty.resolver.dns.DnsResolveContext.onResponse(..)(Unknown Source) ~[netty-resolver-dns-4.1.100.Final.jar:4.1.100.Final]\n", "timestampSeconds": 1698153473, "timestampNanos": 921992259 } { "severity": "DEBUG", "logging.googleapis.com/sourceLocation": { "function": "log", "file": "AbstractInternalLogger.java", "line": 191 }, "message": "DefaultDnsQuestion(np-dns-fix-zeebe-2.np-dns-fix-zeebe.np-dns-fix.svc.np-dns-fix.svc.cluster.local. IN CNAME) query never written and failed", "serviceContext": { "service": "zeebe", "version": "np-dns-fix" }, "context": { "threadId": 43, "threadPriority": 5, "loggerName": "io.netty.resolver.dns.LoggingDnsQueryLifeCycleObserverFactory", "threadName": "netty-messaging-event-epoll-client-2" }, "exception": "io.netty.resolver.dns.DnsResolveContext$DnsResolveContextException: No name servers returned an answer\n\tat io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(..)(Unknown Source) ~[netty-resolver-dns-4.1.100.Final.jar:4.1.100.Final]\n", "timestampSeconds": 1698153473, "timestampNanos": 923597615 } ```
Environment: