Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 305 forks source link

[Feature] Support Node Local DNS Cache #3673

Open damienwebdev opened 1 year ago

damienwebdev commented 1 year ago

Is your feature request related to a problem? Please describe. I'm a developer using NodeJS to server-side render frontend applications. I'm attempting to improve the TTFB of my renders, and in the course of doing so I'm seeing ~8ms of DNS latency when using doing DNS lookups. The important thing to know here is that NodeJS does not cache DNS lookups either in-process or between processes (it relies on OS specific functions and caching like getaddrinfo), leading to a higher than expected volume of DNS requests. There are many articles on the topic:

  1. https://httptoolkit.com/blog/configuring-nodejs-dns/
  2. https://adambrodziak.pl/dns-performance-issues-in-kubernetes-cluster
  3. A video by one of the creators of Libuv

Describe the solution you'd like I would like to leverage Node Local DNS Cache as described by the Kubernetes team.

Describe alternatives you've considered

  1. https://github.com/Azure/AKS/issues/1642
  2. https://github.com/Azure/AKS/issues/1435
  3. I've also considered implementing keep-alives connections in SSR.
  4. https://github.com/Azure/AKS/issues/1492

It looks like (in https://github.com/Azure/AKS/issues/1492) the AKS team has already considered this and has already done some intense work to improve network capabilities, but I'm confused (and concerned) about why https://github.com/Azure/AKS/issues/1492 was closed without implementing the original feature. It looks (from the outside) like this feature was used as a "placeholder" as a fix for a completely different issue.

Can someone clarify why https://github.com/Azure/AKS/issues/1492 was closed?

Additionally, @jnoller points out that user-driven attempts to remedy this problem are also subverted by AKS. Can you explain why? Could we get a flag that allows us to switch this to a daemonset? Otherwise, I'm left quite confused and left with slow HTTP requests for a reason that seems beyond me.

damienwebdev commented 1 year ago

This could be closed, it's possible for users to implement this themselves, but it would be nice to have the AKS team document this specifically for AKS.

dengliu commented 1 year ago

Hi @damienwebdev Have you been able to deploy Node Local DNS to aks? I tried both the official solution from k8s and the suggested aks solution here, neither of them works on AKS

timja commented 1 year ago

We have it working on AKS: https://github.com/hmcts/cnp-flux-config/blob/master/apps/kube-system/nodelocaldns/nodelocaldns.yaml

ref: https://github.com/Azure/AKS/issues/1492#issuecomment-640630349 😂

Neurobion commented 8 months ago

Hi @timja, it has been a few months since your comment and I want to ask if you are still using it without any problems or if you have found something more suitable? Thanks

artificial-aidan commented 8 months ago

I just implemented this today, seems to be working. A few notes for someone new to this.

In @timja's example, the dns ip is 10.0.0.10, this may not be the case for you. This command can be used to query the ip: kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}

(source: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/)

The default memory requests were way too small for my use case, I was seeing 25mb+ of memory used by the node local caches, so make sure you set that correctly, as having a nodelocal pod get OOM killed will result in DNS downtime on that node.

timja commented 8 months ago

it shouldn't get oom killed if there's no limit set, but yes it could probably request more than that if it's needed.

artificial-aidan commented 8 months ago

If no limit is set then if a node has memory pressure it will be higher priority to get killed if it is using more than its requested memory.

lomboboo commented 3 months ago

@artificial-aidan @timja Have you guys figured it out? We also run AKS cluster and after I installed nodelocalcache as per Kubernetes docs (except I had to remove addonmanager.kubernetes.io/mode: Reconcile label) I don't think it is working as I would expect.

When creating new pod in that cluster based on dnsutils image for example and running nslookup google.com we get Server: 10.0.0.10 instead of Server: 169.254.20.10. I would expect to get 10.0.0.10 on the first call and 169.254.20.10 on all calls after that since it should be cached by node-local-dns.

I am curious if it even supposed to work with AKS or is there anything else that has to be done in order for it to work in AKS managed cluster? Or am I testing it wrongly altogether?

artificial-aidan commented 3 months ago

I think the way I tested it was to look at DNS queries on the kube-dns metrics. They went way down once nodelocal was working.

lomboboo commented 3 months ago

Thanks for the response.

Can you please elaborate on this a little bit? How did you install nodelocaldns in your AKS?

Did you use curl <coredns-pod-xxx>:9153/metrics to get different metrics? If so, which one did you pay attention to?

artificial-aidan commented 3 months ago

I use prometheus to scrape all the metrics, don't remember where they came from. But both nodelocal and coredns export the cache hit metric. And you should be able to see the nodelocal metric increasing. I followed the same steps timja did.

muadnan commented 1 month ago

Thanks for the response.

Can you please elaborate on this a little bit? How did you install nodelocaldns in your AKS?

Did you use curl <coredns-pod-xxx>:9153/metrics to get different metrics? If so, which one did you pay attention to?

hey, @lomboboo have you figured out it is working as expected or not?