Open damienwebdev opened 1 year ago
This could be closed, it's possible for users to implement this themselves, but it would be nice to have the AKS team document this specifically for AKS.
Hi @damienwebdev Have you been able to deploy Node Local DNS to aks? I tried both the official solution from k8s and the suggested aks solution here, neither of them works on AKS
Hi @timja, it has been a few months since your comment and I want to ask if you are still using it without any problems or if you have found something more suitable? Thanks
I just implemented this today, seems to be working. A few notes for someone new to this.
In @timja's example, the dns ip is 10.0.0.10
, this may not be the case for you. This command can be used to query the ip: kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}
(source: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/)
The default memory requests were way too small for my use case, I was seeing 25mb+ of memory used by the node local caches, so make sure you set that correctly, as having a nodelocal pod get OOM killed will result in DNS downtime on that node.
it shouldn't get oom killed if there's no limit set, but yes it could probably request more than that if it's needed.
If no limit is set then if a node has memory pressure it will be higher priority to get killed if it is using more than its requested memory.
@artificial-aidan @timja
Have you guys figured it out? We also run AKS cluster and after I installed nodelocalcache
as per Kubernetes docs (except I had to remove addonmanager.kubernetes.io/mode: Reconcile
label) I don't think it is working as I would expect.
When creating new pod in that cluster based on dnsutils
image for example and running nslookup google.com
we get Server: 10.0.0.10
instead of Server: 169.254.20.10
. I would expect to get 10.0.0.10
on the first call and 169.254.20.10
on all calls after that since it should be cached by node-local-dns
.
I am curious if it even supposed to work with AKS or is there anything else that has to be done in order for it to work in AKS managed cluster? Or am I testing it wrongly altogether?
I think the way I tested it was to look at DNS queries on the kube-dns metrics. They went way down once nodelocal was working.
Thanks for the response.
Can you please elaborate on this a little bit? How did you install nodelocaldns
in your AKS?
Did you use curl <coredns-pod-xxx>:9153/metrics
to get different metrics? If so, which one did you pay attention to?
I use prometheus to scrape all the metrics, don't remember where they came from. But both nodelocal and coredns export the cache hit metric. And you should be able to see the nodelocal metric increasing. I followed the same steps timja did.
Thanks for the response.
Can you please elaborate on this a little bit? How did you install
nodelocaldns
in your AKS?Did you use
curl <coredns-pod-xxx>:9153/metrics
to get different metrics? If so, which one did you pay attention to?
hey, @lomboboo have you figured out it is working as expected or not?
Is your feature request related to a problem? Please describe. I'm a developer using NodeJS to server-side render frontend applications. I'm attempting to improve the TTFB of my renders, and in the course of doing so I'm seeing ~8ms of DNS latency when using doing DNS lookups. The important thing to know here is that NodeJS does not cache DNS lookups either in-process or between processes (it relies on OS specific functions and caching like
getaddrinfo
), leading to a higher than expected volume of DNS requests. There are many articles on the topic:Describe the solution you'd like I would like to leverage Node Local DNS Cache as described by the Kubernetes team.
Describe alternatives you've considered
It looks like (in https://github.com/Azure/AKS/issues/1492) the AKS team has already considered this and has already done some intense work to improve network capabilities, but I'm confused (and concerned) about why https://github.com/Azure/AKS/issues/1492 was closed without implementing the original feature. It looks (from the outside) like this feature was used as a "placeholder" as a fix for a completely different issue.
Can someone clarify why https://github.com/Azure/AKS/issues/1492 was closed?
Additionally, @jnoller points out that user-driven attempts to remedy this problem are also subverted by AKS. Can you explain why? Could we get a flag that allows us to switch this to a daemonset? Otherwise, I'm left quite confused and left with slow HTTP requests for a reason that seems beyond me.