Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

AKS clusters are impacted by kube dns issue #56903 #632

Closed invidious9000 closed 5 years ago

invidious9000 commented 6 years ago

What happened: Out-of-the-box AKS clusters that are fully standalone on a subscription have intermittent DNS faults as described in kubernetes #56903. This breaks tooling that has short timeouts and significantly impacts application performance.

What you expected to happen: DNS calls and egress to complete in timely manner with default settings

How to reproduce it (as minimally and precisely as possible): Create a fresh AKS cluster using default settings in any region, run a busybox or ubuntu shell inside the cluster using something like this: kubectl run my-shell --rm -i --tty --image ubuntu -- bash

Once that shell launches run something like this until 5s delay is observed: apt update; apt install -y curl for i in `seq 1 50`; do time curl -s google.com > /dev/null; done

Anything else we need to know?: acs-engine examples are also impacted, issue seems to span multiple versions of kubernetes. GKE seems to not be impacted by these issues but we're still digging into how they've worked around it to see if we can mimic their solution with ARM templates.

These are high quality write-ups on the issue: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

Environment:

segfault commented 6 years ago

:+1: I've been running into this a lot on my acs-engine generated clusters. I've also run rancher generated clusters and run into the same issue. I do not see the same behavior outside of azure.

zyrill commented 6 years ago

Just an FYI if anybody is looking on how to work around this issue: we've decided to port our containers to debian which fixes the issue. Another workaround we tested was installing unbound which also works but obviously caching is not what you really want in a dynamic cluster environment. You also don't want multiple processes running in your containers, but... meh.

Don't even try to fix AKS or acs-engine and save yourself the pain - let's just wait for the kernel patches to land. If you're running acs you can install calico to fix this issue as calico doesn't require iptables for dns resolution.

posix4e commented 6 years ago

We fixed this by switch dns to tcp

bhicks329 commented 6 years ago

Just be careful on switching to TCP for your DNS. We have built 4 new AKS clusters in the last 24 hours and non of them seem to be able to connect to the internal Azure DNS forwarders via TCP. We're having to update out vnet DNS to a custom resolver (Cloud Flare / Google).

zyrill commented 6 years ago

Also, does this mean that counter-intuitively what kind of image you're running inside a container matters? I.e. does alpine still not work with this "fix"?

bhicks329 commented 6 years ago

It does. The issue with using Alpine is that the MUSL library doesn't support the single-request or TCP options for DNS requests.

moomzni commented 6 years ago

We also encountered this issue and have identified a fix by adding the single-request-reopen config value to our DNS policy.

We had been using a curl command such as the below within a loop to test the presence of the issue and observed (consistent every 5-10 calls) dns lookup delays:

curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com

Our solution to this was adding adding the single-request-reopen flag to /etc/resolv.conf which seemed to solve the issue - we experimented with TCP but this seemed to incur a consistent DNS latency of around 1s.

We have added pod DNS policies such as the below to resolve this in our setup (which in turn modifies the /etc/resolv.conf within our containers).

dnsConfig:
  options:
    - name: single-request-reopen
moomzni commented 6 years ago

I should mention this is likely to be the underlying cause... https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts

jskulavik commented 6 years ago

We can confirm this is indeed an issue and that the above resolution is a confirmed work-around

bhicks329 commented 6 years ago

It is the issue and the above is a workaround and not the final fix.

To say this is the fix is to say AKS can’t support Alpine.

jskulavik commented 6 years ago

Thank you for the correction @bhicks329. You are correct, it is not a final fix, but a workaround. We're seeing this on multiple distros, not just Alpine

Nilubkal commented 6 years ago

Hi guys , i've been struggling with the same issue and some of our images are build upon alpine but adding the single-request-reopen doesn't solve the delays . Were you able to make it work on an alpine image ?

bhicks329 commented 6 years ago

@Nilubkal - Alpine doesn't support those options. You will have to switch to a non-musl image like Debian for now.

Nilubkal commented 6 years ago

@bhicks329 Thanks ! I'll go for some of the "slim" containers (jessie-slim).

jleechp-occm commented 5 years ago

Any timeframe for this being fixed?

Ran into this today on a cluster running 1.11.3. The dnsConfig workaround didn't change the results and using debian:7.11-slim has the error as well so switching to Debian doesn't seem to be a fix currently.

We don't have the issue on a 1.9.9 cluster (but it does have routing issues due to using the kubenet plugin rather than Azure so can't use it for any scaled out loads.

Urik commented 5 years ago

@jleechp-occm are you sure you applied the dnsConfig fix and migrated the Docker image to slim on the Pod making the requests rather than the one receiving them?
I applied the workaround yesterday to a Pod of mine running on AKS West US 2, K8s version 1.11.3 as well, and it fixed the issue.
Make sure your Pod is also pulling the new debian:7.11-slim image, review its imagePullPolicy.

Nilubkal commented 5 years ago

@jleechp-occm Yep, @Urik is right i can confirm that that's the way we solved it with our images which were alpine based.

jleechp-occm commented 5 years ago

@Nilubkal , @urik, Still getting those results and destroyed the replica set to make sure it was pulling the correct config (just in case I somehow had made changes and it didn't take effect).

Edit : We're on EastUS and using Advanced Networking (set up using Terraform with the Azure Network plugin). The DNS issues also impact querying DNS for records on our peered VNETs.


Deployment Config (kubectl run debian and then edited in the dnsConfig):

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
  creationTimestamp: 2018-11-09T15:15:15Z
  generation: 4
  labels:
    run: debian
  name: debian
  namespace: default
  resourceVersion: "547077"
  selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/debian
  uid: 42fa737c-e432-11e8-8369-d2e233eae55b
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      run: debian
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: debian
    spec:
      containers:
      - args:
        - bash
        image: debian:7.11-slim
        imagePullPolicy: Always
        name: debian
        resources: {}
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
      dnsConfig:
        options:
        - name: single-request-reopen
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2018-11-09T15:15:15Z
    lastUpdateTime: 2018-11-09T15:15:15Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: 2018-11-09T15:15:15Z
    lastUpdateTime: 2018-11-09T15:23:49Z
    message: ReplicaSet "debian-7f47487679" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 4
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

resolv.conf

root@debian-7f47487679-n4lpn:/# cat /etc/resolv.conf
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local 5x3tq5lz12au5atnsatgdwdc3h.bx.internal.cloudapp.net
options ndots:5 single-request-reopen

Curl results:

root@debian-7f47487679-n4lpn:/# for i in {1..10}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [10.051] Time-Connect [10.054] Time-PreTransfer [10.062] Time-StartTransfer [10.073] Total-Time [10.073] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.013] Time-PreTransfer [0.021] Time-StartTransfer [0.033] Total-Time [0.033] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.012] Time-PreTransfer [0.021] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.014] Time-PreTransfer [0.024] Time-StartTransfer [0.036] Total-Time [0.036] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.014] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.009] Time-PreTransfer [0.018] Time-StartTransfer [0.031] Total-Time [0.031] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.012] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.013] Time-PreTransfer [0.030] Time-StartTransfer [0.043] Total-Time [0.043] Response-Code [200]
DNS-Lookup [0.058] Time-Connect [0.061] Time-PreTransfer [0.070] Time-StartTransfer [0.082] Total-Time [0.082] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.012] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
root@debian-7f47487679-n4lpn:/# for i in {1..10}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [8.043] Time-Connect [8.045] Time-PreTransfer [8.054] Time-StartTransfer [8.067] Total-Time [8.067] Response-Code [200]
DNS-Lookup [0.008] Time-Connect [0.010] Time-PreTransfer [0.022] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.013] Time-PreTransfer [0.023] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.008] Time-Connect [0.011] Time-PreTransfer [0.021] Time-StartTransfer [0.034] Total-Time [0.035] Response-Code [200]
DNS-Lookup [0.007] Time-Connect [0.009] Time-PreTransfer [0.018] Time-StartTransfer [0.032] Total-Time [0.033] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.008] Time-PreTransfer [0.019] Time-StartTransfer [0.033] Total-Time [0.033] Response-Code [200]
DNS-Lookup [0.007] Time-Connect [0.010] Time-PreTransfer [0.020] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.008] Time-PreTransfer [0.017] Time-StartTransfer [0.029] Total-Time [0.029] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.007] Time-PreTransfer [0.016] Time-StartTransfer [0.028] Total-Time [0.028] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.010] Time-PreTransfer [0.020] Time-StartTransfer [0.034] Total-Time [0.034] Response-Code [200]
Urik commented 5 years ago

Hmmm I just tried running your deployment, running bash on the container and running the curl command and had no abnormal results at all.

DNS-Lookup [0.016] Time-Connect [0.021] Time-PreTransfer [0.034] Time-StartTransfer [0.046] Total-Time [0.046] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.028] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.014] Time-Connect [0.019] Time-PreTransfer [0.033] Time-StartTransfer [0.048] Total-Time [0.048] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.040] Total-Time [0.040] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.016] Time-PreTransfer [0.029] Time-StartTransfer [0.046] Total-Time [0.046] Response-Code [200]
DNS-Lookup [0.011] Time-Connect [0.016] Time-PreTransfer [0.029] Time-StartTransfer [0.045] Total-Time [0.045] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.016] Time-PreTransfer [0.030] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.027] Time-StartTransfer [0.040] Total-Time [0.040] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.029] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.027] Time-StartTransfer [0.038] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.031] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.012] Time-Connect [0.017] Time-PreTransfer [0.030] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.031] Time-StartTransfer [0.044] Total-Time [0.044] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.015] Time-PreTransfer [0.027] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.029] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.010] Time-Connect [0.015] Time-PreTransfer [0.028] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.041] Total-Time [0.041] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.040] Total-Time [0.040] Response-Code [200]
DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.027] Time-StartTransfer [0.042] Total-Time [0.042] Response-Code [200]
root@debian-7f47487679-c92vk:/#

Then I commented out the dnsConfig object, deleted and relaunched the deployment, and these are my results:

DNS-Lookup [0.009] Time-Connect [0.014] Time-PreTransfer [0.036] Time-StartTransfer [0.055] Total-Time [0.055] Response-Code [200]
DNS-Lookup [5.017] Time-Connect [5.022] Time-PreTransfer [5.037] Time-StartTransfer [5.050] Total-Time [5.050] Response-Code [200]
DNS-Lookup [0.007] Time-Connect [0.012] Time-PreTransfer [0.026] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.008] Time-Connect [0.013] Time-PreTransfer [0.027] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.029] Time-StartTransfer [0.041] Total-Time [0.041] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.037] Total-Time [0.037] Response-Code [200]
DNS-Lookup [5.015] Time-Connect [5.020] Time-PreTransfer [5.037] Time-StartTransfer [5.049] Total-Time [5.049] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.035] Total-Time [0.035] Response-Code [200]
DNS-Lookup [5.015] Time-Connect [5.020] Time-PreTransfer [5.033] Time-StartTransfer [5.048] Total-Time [5.048] Response-Code [200]
DNS-Lookup [5.016] Time-Connect [5.021] Time-PreTransfer [5.034] Time-StartTransfer [5.048] Total-Time [5.048] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.011] Time-PreTransfer [0.028] Time-StartTransfer [0.043] Total-Time [0.043] Response-Code [200]
DNS-Lookup [5.018] Time-Connect [5.023] Time-PreTransfer [5.039] Time-StartTransfer [5.051] Total-Time [5.051] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.030] Time-StartTransfer [0.043] Total-Time [0.043] Response-Code [200]
DNS-Lookup [5.012] Time-Connect [5.017] Time-PreTransfer [5.030] Time-StartTransfer [5.045] Total-Time [5.045] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.037] Total-Time [0.037] Response-Code [200]
DNS-Lookup [0.005] Time-Connect [0.010] Time-PreTransfer [0.023] Time-StartTransfer [0.037] Total-Time [0.037] Response-Code [200]
DNS-Lookup [0.013] Time-Connect [0.018] Time-PreTransfer [0.032] Time-StartTransfer [0.046] Total-Time [0.046] Response-Code [200]
DNS-Lookup [0.006] Time-Connect [0.011] Time-PreTransfer [0.024] Time-StartTransfer [0.039] Total-Time [0.039] Response-Code [200]
DNS-Lookup [5.010] Time-Connect [5.016] Time-PreTransfer [5.030] Time-StartTransfer [5.043] Total-Time [5.043] Response-Code [200]

What region is your cluster on? I'll try executing it there.

Urik commented 5 years ago

Tried it in East US, same results. Consistent 5 seconds lookup times without dnsConfig, no delays with dnsConfig set to single-request-reopen

MaurGi commented 5 years ago

We just encountered this on debian:jessie - our symptom was a 10x performance degradation on our Apache Storm clusters - We worked around this with the single-request-reopen on resolv.conf as described above.

jackfrancis commented 5 years ago

As all AKS regions now deliver clusters backed by vms running the 4.15.0-1030-azure version of the Ubuntu (which in tl;dr terms "fixes 5 second lookups for alpine-based images"), could we get any stories for folks running 4.15.0-1030-azure? Do we still have issues on that kernel?

For folks running pre-existing clusters (the kernel patch was released just last week) upgrade/scale will be required to get that kernel version on your nodes, FYI.

evillgenius75 commented 5 years ago

can we remove the workaround that was provided early in this thread?

Our solution to this was adding adding the single-request-reopen flag to /etc/resolv.conf

joekohlsdorf commented 5 years ago

@jackfrancis Can you explain what was done in this image?

Only 1 of 3 possible races have been fixed in the kernel.

jackfrancis commented 5 years ago

@joekohlsdorf According to the author of the kernel fix, he has submitted 2 fixes, one of which has been accepted and is present in that kernel version. Another of which is still in review. See the "Kernel Fix" section here:

https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts

ysijason commented 5 years ago

Upgraded AKS cluster from 1.11.3 to 1.11.4 and it seems to have solved the DNS problem.

shaikatz commented 5 years ago

@jackfrancis, the second fix was accepted any way to get it into the next VHD?

jackfrancis commented 5 years ago

Hi @shaikatz Which component received the fix above?

graemechristie commented 5 years ago

Looks to still be an issue. This is running on an Alpine linux POD (DotNet Core Runtime) on an aks cluster running 1.14.6 I just created yesterday.

Also occurs on the 1.14.3 cluster I created a month or so ago, but not on the 1.14.3 cluster I created a couple of weeks after it.

root@xxxx-5c9d8f946c-qjckv:/app# for i in {1..10}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [5.514728] Time-Connect [5.553433] Time-PreTransfer [5.639539] Time-StartTransfer [5.900052] Total-Time [5.900158] Response-Code [200]
DNS-Lookup [0.012360] Time-Connect [0.048973] Time-PreTransfer [0.130598] Time-StartTransfer [0.172433] Total-Time [0.172515] Response-Code [200]
DNS-Lookup [10.519981] Time-Connect [10.556604] Time-PreTransfer [10.638080] Time-StartTransfer [10.680471] Total-Time [10.680590] Response-Code [200]
DNS-Lookup [0.012308] Time-Connect [0.051043] Time-PreTransfer [0.136957] Time-StartTransfer [0.182661] Total-Time [0.182709] Response-Code [200]
DNS-Lookup [0.012330] Time-Connect [0.053787] Time-PreTransfer [0.144233] Time-StartTransfer [0.190418] Total-Time [0.190478] Response-Code [200]
DNS-Lookup [0.012337] Time-Connect [0.051147] Time-PreTransfer [0.136368] Time-StartTransfer [0.180897] Total-Time [0.180985] Response-Code [200]
DNS-Lookup [0.012311] Time-Connect [0.051006] Time-PreTransfer [0.137882] Time-StartTransfer [0.182414] Total-Time [0.182584] Response-Code [200]
DNS-Lookup [0.012246] Time-Connect [0.053550] Time-PreTransfer [0.145398] Time-StartTransfer [0.192282] Total-Time [0.192329] Response-Code [200]
DNS-Lookup [0.012312] Time-Connect [0.048700] Time-PreTransfer [0.130747] Time-StartTransfer [0.172230] Total-Time [0.172287] Response-Code [200]
DNS-Lookup [0.012296] Time-Connect [0.051079] Time-PreTransfer [0.137897] Time-StartTransfer [0.181873] Total-Time [0.181924] Response-Code [200]
zuehlke-dea commented 5 years ago

Seems it was fixed and is present again. We faced the problem. Researched and could not reproduce it. Now it occurs again on our clusters with 4.15.0-1060-azure kernel nodes.

akhan4u commented 4 years ago

I am still facing the same issue. The DNS is throwing intermittent 403 errors.

# for i in {1..100}; do curl -k -o /dev/null -s -w "DNS-Lookup [%{time_namelookup}] Time-Connect [%{time_connect}] Time-PreTransfer [%{time_pretransfer}] Time-StartTransfer [%{time_starttransfer}] Total-Time [%{time_total}] Response-Code [%{http_code}]\n" https://www.microsoft.com; done
DNS-Lookup [0.004265] Time-Connect [0.009595] Time-PreTransfer [0.030827] Time-StartTransfer [0.042685] Total-Time [0.042716] Response-Code [403]
DNS-Lookup [0.004171] Time-Connect [0.009412] Time-PreTransfer [0.029765] Time-StartTransfer [0.041380] Total-Time [0.041429] Response-Code [403]
DNS-Lookup [0.004227] Time-Connect [0.009382] Time-PreTransfer [0.030158] Time-StartTransfer [0.041474] Total-Time [0.041520] Response-Code [403]
DNS-Lookup [0.004184] Time-Connect [0.009496] Time-PreTransfer [0.031301] Time-StartTransfer [0.044301] Total-Time [0.044332] Response-Code [403]
DNS-Lookup [0.004644] Time-Connect [0.009940] Time-PreTransfer [0.036492] Time-StartTransfer [0.063031] Total-Time [0.063060] Response-Code [403]
DNS-Lookup [0.004192] Time-Connect [0.009530] Time-PreTransfer [0.030280] Time-StartTransfer [0.042634] Total-Time [0.042704] Response-Code [403]
DNS-Lookup [0.004211] Time-Connect [0.009625] Time-PreTransfer [0.030305] Time-StartTransfer [0.041616] Total-Time [0.041708] Response-Code [403]
DNS-Lookup [0.004229] Time-Connect [0.009775] Time-PreTransfer [0.030433] Time-StartTransfer [0.043312] Total-Time [0.043340] Response-Code [403]
DNS-Lookup [0.004217] Time-Connect [0.009817] Time-PreTransfer [0.032401] Time-StartTransfer [0.045907] Total-Time [0.045938] Response-Code [403]
DNS-Lookup [0.004250] Time-Connect [0.009633] Time-PreTransfer [0.030357] Time-StartTransfer [0.043249] Total-Time [0.043294] Response-Code [403]
DNS-Lookup [0.004432] Time-Connect [0.009969] Time-PreTransfer [0.031895] Time-StartTransfer [0.043061] Total-Time [0.043091] Response-Code [403]
DNS-Lookup [0.004172] Time-Connect [0.009507] Time-PreTransfer [0.030370] Time-StartTransfer [0.042121] Total-Time [0.042150] Response-Code [403]
DNS-Lookup [0.004211] Time-Connect [0.009604] Time-PreTransfer [0.030194] Time-StartTransfer [0.045623] Total-Time [0.045664] Response-Code [403]
DNS-Lookup [0.004172] Time-Connect [0.009656] Time-PreTransfer [0.029532] Time-StartTransfer [0.041521] Total-Time [0.041548] Response-Code [403]
DNS-Lookup [0.004207] Time-Connect [0.009613] Time-PreTransfer [0.033558] Time-StartTransfer [0.050680] Total-Time [0.050757] Response-Code [403]
DNS-Lookup [0.004263] Time-Connect [0.009582] Time-PreTransfer [0.030867] Time-StartTransfer [0.043268] Total-Time [0.043296] Response-Code [403]
DNS-Lookup [0.004188] Time-Connect [0.009518] Time-PreTransfer [0.030763] Time-StartTransfer [0.043596] Total-Time [0.043623] Response-Code [403]
DNS-Lookup [0.004191] Time-Connect [0.009637] Time-PreTransfer [0.030120] Time-StartTransfer [0.041989] Total-Time [0.042026] Response-Code [403]
DNS-Lookup [0.004154] Time-Connect [0.009565] Time-PreTransfer [0.030270] Time-StartTransfer [0.043543] Total-Time [0.043572] Response-Code [403]
DNS-Lookup [0.004216] Time-Connect [0.009702] Time-PreTransfer [0.029827] Time-StartTransfer [0.042480] Total-Time [0.042509] Response-Code [403]
eivim commented 4 years ago

Seems it was fixed and is present again. We faced the problem. Researched and could not reproduce it. Now it occurs again on our clusters with 4.15.0-1060-azure kernel nodes.

Same observed with nodes running on 4.15.0-1061-azure. Intermittent 2.5 second DNS lookup delays. Using alpine images almost exclusively, so had to move our traffic to GKE for the time being. Works fine there.

zuehlke-dea commented 4 years ago

As moving our workloads to another cloud provide is no option for us we have added this workaround for now:

`dnsConfig: options:

added in the spec for the deployment. See also https://github.com/Azure/AKS/issues/667

The DNS issue was really a problem for us, as some service had timeouts lower than 5 seconds configured for connections which caused a lot of trouble then. With the fix it works fine now. We also suffered the issue not only on alpine based images but also with debian based ones.

droessmj commented 4 years ago

Seeing similar issues that we've worked around (temporarily for only a few services) with a dnsPolicy: Default setting. Will also try the single-request-reopen option specified above.

We're running debian based dotnet core 3 images, and our impacted hosts are showing kernel version 4.15.0-1063-azure.

I should also note that organizationally we have 20+ clusters across multiple business units and environments, and only the two clusters which have been built in the past week show the issue currently.

Unable to recreate with same image on a cluster with nodes running 4.15.0-1055-azure nor 4.15.0-1060-azure kernel.

Vandersteen commented 4 years ago

I can replicate this issue on a cluster with kernel version 4.15.0-1060-azure

droessmj commented 4 years ago

Circling around to close to close the loop on the occurrence I reported. Our root cause was the implementation of an NSG rule which denied UDP traffic on the cluster subnet, which caused inter-node requests to timeout.

cpwood commented 4 years ago

Still seeing this in our AKS cluster running Kubernetes 1.15.7. Applying the dnsConfig change to the deployment works around the problem for us. This has caused quite a bit of end-user frustration at the slowness though, sadly.

Vandersteen commented 4 years ago

Where can we follow up the status of these kernel changes ? Is this fixed ?

btastic commented 4 years ago

Has anyone tried 1.16.7 already? It is available on azure.. We are going to test it this evening (in around 2-3 hours)

jemag commented 4 years ago

@btastic just retested today with 1.16.7 and so far I wasn't able to replicate the issue, whereas it was definitely there a couple versions ago.

btastic commented 4 years ago

Thanks for reporting back. @jemag do you have the dashboard issue described here? https://github.com/Azure/AKS/issues/1534

jemag commented 4 years ago

@btastic we do not use the dashboard, I couldn't say

Vandersteen commented 4 years ago

@btastic @jemag I just tested it on a cluster with k8s 1.16.7 (using CNI) The issue still exists