Closed invidious9000 closed 5 years ago
:+1: I've been running into this a lot on my acs-engine generated clusters. I've also run rancher generated clusters and run into the same issue. I do not see the same behavior outside of azure.
Just an FYI if anybody is looking on how to work around this issue: we've decided to port our containers to debian which fixes the issue. Another workaround we tested was installing unbound which also works but obviously caching is not what you really want in a dynamic cluster environment. You also don't want multiple processes running in your containers, but... meh.
Don't even try to fix AKS or acs-engine and save yourself the pain - let's just wait for the kernel patches to land. If you're running acs you can install calico to fix this issue as calico doesn't require iptables for dns resolution.
We fixed this by switch dns to tcp
Just be careful on switching to TCP for your DNS. We have built 4 new AKS clusters in the last 24 hours and non of them seem to be able to connect to the internal Azure DNS forwarders via TCP. We're having to update out vnet DNS to a custom resolver (Cloud Flare / Google).
Also, does this mean that counter-intuitively what kind of image you're running inside a container matters? I.e. does alpine still not work with this "fix"?
It does. The issue with using Alpine is that the MUSL library doesn't support the single-request or TCP options for DNS requests.
We also encountered this issue and have identified a fix by adding the single-request-reopen
config value to our DNS policy.
We had been using a curl command such as the below within a loop to test the presence of the issue and observed (consistent every 5-10 calls) dns lookup delays:
curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com
Our solution to this was adding adding the single-request-reopen
flag to /etc/resolv.conf
which seemed to solve the issue - we experimented with TCP but this seemed to incur a consistent DNS latency of around 1s.
We have added pod DNS policies such as the below to resolve this in our setup (which in turn modifies the /etc/resolv.conf within our containers).
dnsConfig:
options:
- name: single-request-reopen
I should mention this is likely to be the underlying cause... https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
We can confirm this is indeed an issue and that the above resolution is a confirmed work-around
It is the issue and the above is a workaround and not the final fix.
To say this is the fix is to say AKS can’t support Alpine.
Thank you for the correction @bhicks329. You are correct, it is not a final fix, but a workaround. We're seeing this on multiple distros, not just Alpine
Hi guys , i've been struggling with the same issue and some of our images are build upon alpine but adding the single-request-reopen doesn't solve the delays . Were you able to make it work on an alpine image ?
@Nilubkal - Alpine doesn't support those options. You will have to switch to a non-musl image like Debian for now.
@bhicks329 Thanks ! I'll go for some of the "slim" containers (jessie-slim).
Any timeframe for this being fixed?
Ran into this today on a cluster running 1.11.3. The dnsConfig
workaround didn't change the results and using debian:7.11-slim
has the error as well so switching to Debian doesn't seem to be a fix currently.
We don't have the issue on a 1.9.9 cluster (but it does have routing issues due to using the kubenet plugin rather than Azure so can't use it for any scaled out loads.
@jleechp-occm are you sure you applied the dnsConfig
fix and migrated the Docker image to slim
on the Pod making the requests rather than the one receiving them?
I applied the workaround yesterday to a Pod of mine running on AKS West US 2, K8s version 1.11.3 as well, and it fixed the issue.
Make sure your Pod is also pulling the new debian:7.11-slim
image, review its imagePullPolicy.
@jleechp-occm Yep, @Urik is right i can confirm that that's the way we solved it with our images which were alpine based.
@Nilubkal , @urik, Still getting those results and destroyed the replica set to make sure it was pulling the correct config (just in case I somehow had made changes and it didn't take effect).
Edit : We're on EastUS and using Advanced Networking (set up using Terraform with the Azure Network plugin). The DNS issues also impact querying DNS for records on our peered VNETs.
Deployment Config (kubectl run debian
and then edited in the dnsConfig):
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "4"
creationTimestamp: 2018-11-09T15:15:15Z
generation: 4
labels:
run: debian
name: debian
namespace: default
resourceVersion: "547077"
selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/debian
uid: 42fa737c-e432-11e8-8369-d2e233eae55b
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
run: debian
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
run: debian
spec:
containers:
- args:
- bash
image: debian:7.11-slim
imagePullPolicy: Always
name: debian
resources: {}
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true
dnsConfig:
options:
- name: single-request-reopen
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: 2018-11-09T15:15:15Z
lastUpdateTime: 2018-11-09T15:15:15Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2018-11-09T15:15:15Z
lastUpdateTime: 2018-11-09T15:23:49Z
message: ReplicaSet "debian-7f47487679" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 4
readyReplicas: 1
replicas: 1
updatedReplicas: 1
resolv.conf
root@debian-7f47487679-n4lpn:/# cat /etc/resolv.conf
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local 5x3tq5lz12au5atnsatgdwdc3h.bx.internal.cloudapp.net
options ndots:5 single-request-reopen
Curl results:
root@debian-7f47487679-n4lpn:/# for i in {1..10}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [10.051] Time-Connect [10.054] Time-PreTransfer [10.062] Time-StartTransfer [10.073] Total-Time [10.073] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.013] Time-PreTransfer [0.021] Time-StartTransfer [0.033] Total-Time [0.033] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.012] Time-PreTransfer [0.021] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.014] Time-PreTransfer [0.024] Time-StartTransfer [0.036] Total-Time [0.036] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.014] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.009] Time-PreTransfer [0.018] Time-StartTransfer [0.031] Total-Time [0.031] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.012] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.013] Time-PreTransfer [0.030] Time-StartTransfer [0.043] Total-Time [0.043] Response-Code [200]
DNS-Lookup [0.058] Time-Connect [0.061] Time-PreTransfer [0.070] Time-StartTransfer [0.082] Total-Time [0.082] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.012] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
root@debian-7f47487679-n4lpn:/# for i in {1..10}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [8.043] Time-Connect [8.045] Time-PreTransfer [8.054] Time-StartTransfer [8.067] Total-Time [8.067] Response-Code [200]
DNS-Lookup [0.008] Time-Connect [0.010] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.013] Time-PreTransfer [0.023] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.008] Time-Connect [0.011] Time-PreTransfer [0.021] Time-StartTransfer [0.034] Total-Time [0.035] Response-Code [200]
DNS-Lookup [0.007] Time-Connect [0.009] Time-PreTransfer [0.018] Time-StartTransfer [0.032] Total-Time [0.033] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.008] Time-PreTransfer [0.019] Time-StartTransfer [0.033] Total-Time [0.033] Response-Code [200]
DNS-Lookup [0.007] Time-Connect [0.010] Time-PreTransfer [0.020] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.008] Time-PreTransfer [0.017] Time-StartTransfer [0.029] Total-Time [0.029] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.007] Time-PreTransfer [0.016] Time-StartTransfer [0.028] Total-Time [0.028] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.010] Time-PreTransfer [0.020] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
Hmmm I just tried running your deployment, running bash on the container and running the curl command and had no abnormal results at all.
DNS-Lookup [0.016] Time-Connect [0.021] Time-PreTransfer [0.034] Time-StartTransfer [0.046] Total-Time [0.046] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.028] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.014] Time-Connect [0.019] Time-PreTransfer [0.033] Time-StartTransfer [0.048] Total-Time [0.048] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.040] Total-Time [0.040] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.016] Time-PreTransfer [0.029] Time-StartTransfer [0.046] Total-Time [0.046] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.016] Time-PreTransfer [0.029] Time-StartTransfer [0.045] Total-Time [0.045] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.016] Time-PreTransfer [0.030] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.027] Time-StartTransfer [0.040] Total-Time [0.040] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.029] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.027] Time-StartTransfer [0.038] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.031] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.012] Time-Connect [0.017] Time-PreTransfer [0.030] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.031] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.027] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.029] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.028] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.041] Total-Time [0.041] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.040] Total-Time [0.040] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
root@debian-7f47487679-c92vk:/#
Then I commented out the dnsConfig object, deleted and relaunched the deployment, and these are my results:
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.036] Time-StartTransfer [0.055] Total-Time [0.055] Response-Code [200]
DNS-Lookup [5.017] Time-Connect [5.022] Time-PreTransfer [5.037] Time-StartTransfer [5.050] Total-Time [5.050] Response-Code [200]
DNS-Lookup [0.007] Time-Connect [0.012] Time-PreTransfer [0.026] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.008] Time-Connect [0.013] Time-PreTransfer [0.027] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.029] Time-StartTransfer [0.041] Total-Time [0.041] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.037] Total-Time [0.037] Response-Code [200]
DNS-Lookup [5.015] Time-Connect [5.020] Time-PreTransfer [5.037] Time-StartTransfer [5.049] Total-Time [5.049] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.035] Total-Time [0.035] Response-Code [200]
DNS-Lookup [5.015] Time-Connect [5.020] Time-PreTransfer [5.033] Time-StartTransfer [5.048] Total-Time [5.048] Response-Code [200]
DNS-Lookup [5.016] Time-Connect [5.021] Time-PreTransfer [5.034] Time-StartTransfer [5.048] Total-Time [5.048] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.011] Time-PreTransfer [0.028] Time-StartTransfer [0.043] Total-Time [0.043] Response-Code [200]
DNS-Lookup [5.018] Time-Connect [5.023] Time-PreTransfer [5.039] Time-StartTransfer [5.051] Total-Time [5.051] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.030] Time-StartTransfer [0.043] Total-Time [0.043] Response-Code [200]
DNS-Lookup [5.012] Time-Connect [5.017] Time-PreTransfer [5.030] Time-StartTransfer [5.045] Total-Time [5.045] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.037] Total-Time [0.037] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.037] Total-Time [0.037] Response-Code [200]
DNS-Lookup [0.013] Time-Connect [0.018] Time-PreTransfer [0.032] Time-StartTransfer [0.046] Total-Time [0.046] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.011] Time-PreTransfer [0.024] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [5.010] Time-Connect [5.016] Time-PreTransfer [5.030] Time-StartTransfer [5.043] Total-Time [5.043] Response-Code [200]
What region is your cluster on? I'll try executing it there.
Tried it in East US, same results. Consistent 5 seconds lookup times without dnsConfig, no delays with dnsConfig set to single-request-reopen
We just encountered this on debian:jessie - our symptom was a 10x performance degradation on our Apache Storm clusters - We worked around this with the single-request-reopen on resolv.conf as described above.
As all AKS regions now deliver clusters backed by vms running the 4.15.0-1030-azure
version of the Ubuntu (which in tl;dr terms "fixes 5 second lookups for alpine-based images"), could we get any stories for folks running 4.15.0-1030-azure
? Do we still have issues on that kernel?
For folks running pre-existing clusters (the kernel patch was released just last week) upgrade/scale will be required to get that kernel version on your nodes, FYI.
can we remove the workaround that was provided early in this thread?
Our solution to this was adding adding the single-request-reopen flag to /etc/resolv.conf
@jackfrancis Can you explain what was done in this image?
Only 1 of 3 possible races have been fixed in the kernel.
@joekohlsdorf According to the author of the kernel fix, he has submitted 2 fixes, one of which has been accepted and is present in that kernel version. Another of which is still in review. See the "Kernel Fix" section here:
https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
Upgraded AKS cluster from 1.11.3 to 1.11.4 and it seems to have solved the DNS problem.
@jackfrancis, the second fix was accepted any way to get it into the next VHD?
Hi @shaikatz Which component received the fix above?
Looks to still be an issue. This is running on an Alpine linux POD (DotNet Core Runtime) on an aks cluster running 1.14.6 I just created yesterday.
Also occurs on the 1.14.3 cluster I created a month or so ago, but not on the 1.14.3 cluster I created a couple of weeks after it.
root@xxxx-5c9d8f946c-qjckv:/app# for i in {1..10}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [5.514728] Time-Connect [5.553433] Time-PreTransfer [5.639539] Time-StartTransfer [5.900052] Total-Time [5.900158] Response-Code [200]
DNS-Lookup [0.012360] Time-Connect [0.048973] Time-PreTransfer [0.130598] Time-StartTransfer [0.172433] Total-Time [0.172515] Response-Code [200]
DNS-Lookup [10.519981] Time-Connect [10.556604] Time-PreTransfer [10.638080] Time-StartTransfer [10.680471] Total-Time [10.680590] Response-Code [200]
DNS-Lookup [0.012308] Time-Connect [0.051043] Time-PreTransfer [0.136957] Time-StartTransfer [0.182661] Total-Time [0.182709] Response-Code [200]
DNS-Lookup [0.012330] Time-Connect [0.053787] Time-PreTransfer [0.144233] Time-StartTransfer [0.190418] Total-Time [0.190478] Response-Code [200]
DNS-Lookup [0.012337] Time-Connect [0.051147] Time-PreTransfer [0.136368] Time-StartTransfer [0.180897] Total-Time [0.180985] Response-Code [200]
DNS-Lookup [0.012311] Time-Connect [0.051006] Time-PreTransfer [0.137882] Time-StartTransfer [0.182414] Total-Time [0.182584] Response-Code [200]
DNS-Lookup [0.012246] Time-Connect [0.053550] Time-PreTransfer [0.145398] Time-StartTransfer [0.192282] Total-Time [0.192329] Response-Code [200]
DNS-Lookup [0.012312] Time-Connect [0.048700] Time-PreTransfer [0.130747] Time-StartTransfer [0.172230] Total-Time [0.172287] Response-Code [200]
DNS-Lookup [0.012296] Time-Connect [0.051079] Time-PreTransfer [0.137897] Time-StartTransfer [0.181873] Total-Time [0.181924] Response-Code [200]
Seems it was fixed and is present again. We faced the problem. Researched and could not reproduce it. Now it occurs again on our clusters with 4.15.0-1060-azure kernel nodes.
I am still facing the same issue. The DNS is throwing intermittent 403 errors.
# for i in {1..100}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [0.004265] Time-Connect [0.009595] Time-PreTransfer [0.030827] Time-StartTransfer [0.042685] Total-Time [0.042716] Response-Code [403]
DNS-Lookup [0.004171] Time-Connect [0.009412] Time-PreTransfer [0.029765] Time-StartTransfer [0.041380] Total-Time [0.041429] Response-Code [403]
DNS-Lookup [0.004227] Time-Connect [0.009382] Time-PreTransfer [0.030158] Time-StartTransfer [0.041474] Total-Time [0.041520] Response-Code [403]
DNS-Lookup [0.004184] Time-Connect [0.009496] Time-PreTransfer [0.031301] Time-StartTransfer [0.044301] Total-Time [0.044332] Response-Code [403]
DNS-Lookup [0.004644] Time-Connect [0.009940] Time-PreTransfer [0.036492] Time-StartTransfer [0.063031] Total-Time [0.063060] Response-Code [403]
DNS-Lookup [0.004192] Time-Connect [0.009530] Time-PreTransfer [0.030280] Time-StartTransfer [0.042634] Total-Time [0.042704] Response-Code [403]
DNS-Lookup [0.004211] Time-Connect [0.009625] Time-PreTransfer [0.030305] Time-StartTransfer [0.041616] Total-Time [0.041708] Response-Code [403]
DNS-Lookup [0.004229] Time-Connect [0.009775] Time-PreTransfer [0.030433] Time-StartTransfer [0.043312] Total-Time [0.043340] Response-Code [403]
DNS-Lookup [0.004217] Time-Connect [0.009817] Time-PreTransfer [0.032401] Time-StartTransfer [0.045907] Total-Time [0.045938] Response-Code [403]
DNS-Lookup [0.004250] Time-Connect [0.009633] Time-PreTransfer [0.030357] Time-StartTransfer [0.043249] Total-Time [0.043294] Response-Code [403]
DNS-Lookup [0.004432] Time-Connect [0.009969] Time-PreTransfer [0.031895] Time-StartTransfer [0.043061] Total-Time [0.043091] Response-Code [403]
DNS-Lookup [0.004172] Time-Connect [0.009507] Time-PreTransfer [0.030370] Time-StartTransfer [0.042121] Total-Time [0.042150] Response-Code [403]
DNS-Lookup [0.004211] Time-Connect [0.009604] Time-PreTransfer [0.030194] Time-StartTransfer [0.045623] Total-Time [0.045664] Response-Code [403]
DNS-Lookup [0.004172] Time-Connect [0.009656] Time-PreTransfer [0.029532] Time-StartTransfer [0.041521] Total-Time [0.041548] Response-Code [403]
DNS-Lookup [0.004207] Time-Connect [0.009613] Time-PreTransfer [0.033558] Time-StartTransfer [0.050680] Total-Time [0.050757] Response-Code [403]
DNS-Lookup [0.004263] Time-Connect [0.009582] Time-PreTransfer [0.030867] Time-StartTransfer [0.043268] Total-Time [0.043296] Response-Code [403]
DNS-Lookup [0.004188] Time-Connect [0.009518] Time-PreTransfer [0.030763] Time-StartTransfer [0.043596] Total-Time [0.043623] Response-Code [403]
DNS-Lookup [0.004191] Time-Connect [0.009637] Time-PreTransfer [0.030120] Time-StartTransfer [0.041989] Total-Time [0.042026] Response-Code [403]
DNS-Lookup [0.004154] Time-Connect [0.009565] Time-PreTransfer [0.030270] Time-StartTransfer [0.043543] Total-Time [0.043572] Response-Code [403]
DNS-Lookup [0.004216] Time-Connect [0.009702] Time-PreTransfer [0.029827] Time-StartTransfer [0.042480] Total-Time [0.042509] Response-Code [403]
Seems it was fixed and is present again. We faced the problem. Researched and could not reproduce it. Now it occurs again on our clusters with 4.15.0-1060-azure kernel nodes.
Same observed with nodes running on 4.15.0-1061-azure. Intermittent 2.5 second DNS lookup delays. Using alpine images almost exclusively, so had to move our traffic to GKE for the time being. Works fine there.
As moving our workloads to another cloud provide is no option for us we have added this workaround for now:
`dnsConfig: options:
added in the spec for the deployment. See also https://github.com/Azure/AKS/issues/667
The DNS issue was really a problem for us, as some service had timeouts lower than 5 seconds configured for connections which caused a lot of trouble then. With the fix it works fine now. We also suffered the issue not only on alpine based images but also with debian based ones.
Seeing similar issues that we've worked around (temporarily for only a few services) with a dnsPolicy: Default
setting. Will also try the single-request-reopen option specified above.
We're running debian based dotnet core 3 images, and our impacted hosts are showing kernel version 4.15.0-1063-azure.
I should also note that organizationally we have 20+ clusters across multiple business units and environments, and only the two clusters which have been built in the past week show the issue currently.
Unable to recreate with same image on a cluster with nodes running 4.15.0-1055-azure nor 4.15.0-1060-azure kernel.
I can replicate this issue on a cluster with kernel version 4.15.0-1060-azure
Circling around to close to close the loop on the occurrence I reported. Our root cause was the implementation of an NSG rule which denied UDP traffic on the cluster subnet, which caused inter-node requests to timeout.
Still seeing this in our AKS cluster running Kubernetes 1.15.7. Applying the dnsConfig
change to the deployment works around the problem for us. This has caused quite a bit of end-user frustration at the slowness though, sadly.
Where can we follow up the status of these kernel changes ? Is this fixed ?
Has anyone tried 1.16.7 already? It is available on azure.. We are going to test it this evening (in around 2-3 hours)
@btastic just retested today with 1.16.7 and so far I wasn't able to replicate the issue, whereas it was definitely there a couple versions ago.
Thanks for reporting back. @jemag do you have the dashboard issue described here? https://github.com/Azure/AKS/issues/1534
@btastic we do not use the dashboard, I couldn't say
@btastic @jemag I just tested it on a cluster with k8s 1.16.7 (using CNI) The issue still exists
What happened: Out-of-the-box AKS clusters that are fully standalone on a subscription have intermittent DNS faults as described in kubernetes #56903. This breaks tooling that has short timeouts and significantly impacts application performance.
What you expected to happen: DNS calls and egress to complete in timely manner with default settings
How to reproduce it (as minimally and precisely as possible): Create a fresh AKS cluster using default settings in any region, run a busybox or ubuntu shell inside the cluster using something like this:
kubectl run my-shell --rm -i --tty --image ubuntu -- bash
Once that shell launches run something like this until 5s delay is observed:
apt update; apt install -y curl
for i in `seq 1 50`; do time curl -s google.com > /dev/null; done
Anything else we need to know?: acs-engine examples are also impacted, issue seems to span multiple versions of kubernetes. GKE seems to not be impacted by these issues but we're still digging into how they've worked around it to see if we can mimic their solution with ARM templates.
These are high quality write-ups on the issue: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02
Environment:
kubectl version
): 1.11.2 but others are also impacted, we have not assessed full scope of impact