awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.45k stars 1.15k forks source link

Nodes become unresponsive and doesnt recover with soft lockup error #454

Closed cshivashankar closed 3 years ago

cshivashankar commented 4 years ago

Status Summary

This section added by @mmerkes

The AWS EKS team has a consistent repro and has engaged with the AmazonLinux team to root cause the issue and fix it. AmazonLinux has merged a patch that solves this issue in our repro and should work for customers, and that is now available via yum update. Once you've updated your kernel and rebooted your instance, you should be running kernel.x86_64 0:4.14.203-156.332.amzn2 (or greater). All EKS optimized AMIs Packer version v20201112 or later will include this patch. Users have 2 options for fixing this issue:

  1. Upgrade to nodes to use the latest EKS optimized AMI
  2. Patch your nodes with yum update

Here's the commands you need to patch your instances:

sudo yum update kernel
sudo reboot

Original Issue

This original content from @cshivashankar

What happened: Node in the cluster becomes unresponsive and pods running on it also becomes unresponsive. As per the analysis and logs provided in AWS Case 6940959821, it was informed that this is observed when high IOPS is observed and a soft lock up happens which causes node to become unresponsive. Further investigation might be required .

What you expected to happen: Node should not crash or become unresponsive , if that was the case , control plane should identify it and mark it as not ready. State should be either node is ready and working properly or node is unresponsive and not ready and should be eventually removed from the cluster.

How to reproduce it (as minimally and precisely as possible): As per the analysis in the AWS case 6940959821 , the issue could be reproduced by having higher IOPS than the capacity of EBS for sustained amount of time.

Anything else we need to know?: This issue is being observed recently and I want to rule it out if it was due to using AMI of version 1.14 as we never observed this issue in 1.13. Is there any kernel bug that I am hitting into? For building the AMI, I cloned the "amazon/aws-eks-ami" repo and did the following changes ~~1. Installed Zabbix agent

  1. Ran the kubelet with "--allow-privileged=true" flag as I was getting issues with cadvisor. So basically AMI being used is practically the same as AWS EKS AMI.~~ Changes mentioned in the following comment

Logs can be accessed in the AWS Case mentioned above Environment:

cshivashankar commented 4 years ago

Ever since I moved to new kernel for 1.14 I am facing lot of issues in the nodes and nodes are going out quite frequently.Some times it doesn't become responsive and be in the ready status and sometimes it just flips to node not ready status. Is there any bug in Kernel which I am hitting into ? If there is any customization which is not relevant to the 1.14 , I can do the changes and test. Any suggestions would be greatly welcome

Following are the soft lock up messages seen

kernel: ena 0000:00:08.0 eth3: Found a Tx that wasn't completed on time, qid 4, index 787.
kernel: watchdog: BUG: soft lockup - CPU#58 stuck for 23s! [kworker/u146:7:707658]
kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xfrm_user xfrm_algo br_netfilter bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_recent ipt_REJECT nf_reject_ipv4 xt_addrtype xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_conntrack nf_nat nf_conntrack overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel psmouse evdev aes_x86_64 crypto_simd glue_helper pcc_cpufreq button cryptd ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
kernel: CPU: 58 PID: 707658 Comm: kworker/u146:7 Tainted: G             L  4.14.165-133.209.amzn2.x86_64 #1
kernel: Hardware name: Amazon EC2 c5.18xlarge/, BIOS 1.0 10/16/2017
kernel: Workqueue: writeback wb_workfn (flush-259:0)
kernel: task: ffff8893fefa0000 task.stack: ffffc9002daec000
kernel: RIP: 0010:__list_del_entry_valid+0x28/0x90
kernel: RSP: 0018:ffffc9002daefcc0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
kernel: RAX: ffff88916d19b470 RBX: ffffc9002daefce8 RCX: dead000000000200
kernel: RDX: ffff88804b1f36c8 RSI: ffff888013237e08 RDI: ffff888013237e08
kernel: RBP: ffff88916d19b470 R08: ffff889488d1eb48 R09: 0000000180400037
kernel: R10: ffffc9002daefe10 R11: 0000000000000000 R12: ffff88916d608800
kernel: R13: ffff888013237e08 R14: ffffc9002daefd78 R15: ffff889488d1eb48
kernel: FS:  0000000000000000(0000) GS:ffff88a371380000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000010923314f000 CR3: 000000000200a002 CR4: 00000000007606e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: move_expired_inodes+0x6a/0x230
kernel: queue_io+0x61/0xf0
kernel: wb_writeback+0x258/0x300
kernel: ? wb_workfn+0xdf/0x370
kernel: ? __local_bh_enable_ip+0x6c/0x70
kernel: wb_workfn+0xdf/0x370
kernel: ? __switch_to_asm+0x41/0x70
kernel: ? __switch_to_asm+0x35/0x70
kernel: process_one_work+0x17b/0x380
kernel: worker_thread+0x2e/0x390
kernel: ? process_one_work+0x380/0x380
kernel: kthread+0x11a/0x130
kernel: ? kthread_create_on_node+0x70/0x70
kernel: ret_from_fork+0x35/0x40

Ideally with the change https://github.com/awslabs/amazon-eks-ami/pull/367 , kubelet should be having plenty of resources and if there are high resource consuming pods , it should just evict it rather than node getting crashed.

I digged into the customizations in detail and following are the customizations done for the AMI used in environment:

Customizations :

  1. Docker daemon related

    1a. Docker env docker.env is used

            # The max number of open files for the daemon itself, and all
            # running containers.  The default value of 1048576 mirrors the value
            # used by the systemd service unit.
            DAEMON_MAXFILES=1048576

            # Additional startup options for the Docker daemon, for example:
            # OPTIONS="--ip-forward=true --iptables=true"
            # By default we limit the number of open files per container
            # Options are moved to docker-daemon.json
            #OPTIONS="--default-ulimit nofile=262144:1048576 --max-concurrent-downloads=100 --max-concurrent-uploads=100 --iptables=false"

            # How many seconds the sysvinit script waits for the pidfile to appear
            # when starting the daemon.
            DAEMON_PIDFILE_TIMEOUT=10

1b. docker daemon json Default ulimits are changed and iptables is also set as false

            {
              "bridge": "none",
              "log-driver": "json-file",
              "log-opts": {
                "max-size": "10m",
                "max-file": "10"
              },
              "live-restore": true,
              "iptables": false,
              "max-concurrent-downloads": 100,
              "max-concurrent-uploads": 100,
              "default-ulimits": {
                "nofile": {
                  "Name": "nofile",
                  "Hard": 1048576,
                  "Soft": 262144
                }
              }
            }
  1. System customizations:

2a. 20-nproc.conf Limit user process to avoid fork bombing


            # Default limit for number of user's processes to prevent
            # accidental fork bombs.
            # See rhbz #432903 for reasoning.

            *          soft    nproc     3934939
            root       soft    nproc     unlimited

2b. sysctl.conf

            # Filesystem

            # Maximum number of allowable concurrent requests
            fs.aio-max-nr = 1048576

            # 1M max_user_watches takes extra 1Gb of kernel memory per real UID
            fs.inotify.max_user_watches = 1048576
            fs.inotify.max_user_instances = 2048

            # Kernel

            # Prevents Docker from crashing with:
            # > runtime/cgo: pthread_create failed: Resource temporarily unavailable
            kernel.pid_max = 1048576

            # Network

            # Max number of packets queued when the interface receives packets faster than
            # kernel can process them. Default is 1000
            net.core.netdev_max_backlog = 5000

            # Max rx/tx buffer sizes. Default 212k sounds too small for our 10g links, but
            # setting some very high numbers may have nasty consequences too
            net.core.rmem_max = 8388608
            net.core.wmem_max = 8388608

            # Threshold levels for ARP cache increased after neighbor table overflowed on DL1
            net.ipv4.neigh.default.gc_thresh1 = 4096
            net.ipv4.neigh.default.gc_thresh2 = 8192
            net.ipv4.neigh.default.gc_thresh3 = 8192

            # Number of memory-mapped areas per process used by elasticsearch and other
            # software that uses mmap
            vm.max_map_count = 262144

~~2c. systemd.conf [Manager]

Cgroup controllers here are joined to make system.slice appear in the common

        # cgroup heirarchy. This way kubelet is able to use "system.slice" as
        # --system-reserved to isolate system resources from pod resources
        JoinControllers=cpu,cpuacct,cpuset,net_cls,net_prio,hugetlb,memory~~
  1. Kubelet customizations

    Following is the kubelet output with ps     
    Eviction and system reserved option might be removed as its set in default.

/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --read-only-port=10255 --network-plugin cni --node-ip=10.77.94.123 --pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --kube-api-qps=30 --kube-api-burst=50 --event-qps=30 --event-burst=50 --registry-qps=0 --enforce-node-allocatable=pods --eviction-hard=imagefs.available<15%,memory.available<5%,nodefs.available<10%,nodefs.inodesFree<5% --serialize-image-pulls=false --v=3 --pod-max-pids=8192 --node-labels=node-role.kubernetes.io/spot-worker --resolv-conf=/etc/resolv.kubelet

/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --read-only-port=10255 --network-plugin cni --node-ip=10.77.94.123 --pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1 --eviction-hard=imagefs.available<15%,memory.available<5%,nodefs.available<10%,nodefs.inodesFree<5% --serialize-image-pulls=false --v=3 --pod-max-pids=8192 --resolv-conf=/etc/resolv.kubelet

resolv.kubelet is following /usr/bin/grep -v "^search " /etc/resolv.conf > /etc/resolv.kubelet

EDIT: Edited this post after removing some customization

edubacco commented 4 years ago

same here, subscribe

colandre commented 4 years ago

We had the same problem upgrading our worker nodes to Amazon EKS 1.15 AMI. We tried:

and both had the same problem.

We have pods with initContainers copying about 1Gb of small files (WP install), and during the copy, in the Init phase, the worker nodes hang, becoming completely unresponsive.

Syslog on the worker node reports:

[  288.053638] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [kworker/u16:2:62]
[  288.059141] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_defrag_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel aes_x86_64 crypto_simd evdev glue_helper psmouse cryptd button ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
[  288.185290] CPU: 5 PID: 62 Comm: kworker/u16:2 Tainted: G             L  4.14.177-139.253.amzn2.x86_64 #1
[  288.191527] Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017
[  288.195344] Workqueue: writeback wb_workfn (flush-259:0)
[  288.198708] task: ffff888184670000 task.stack: ffffc90003360000
[  288.202280] RIP: 0010:move_expired_inodes+0xff/0x230
[  288.205542] RSP: 0018:ffffc90003363cc8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[  288.211042] RAX: 00000000ffffa056 RBX: ffffc90003363ce8 RCX: dead000000000200
[  288.215031] RDX: 0000000000000000 RSI: ffffc90003363ce8 RDI: ffff8887068963c8
[  288.219040] RBP: ffff888802273c70 R08: ffff888706896008 R09: 0000000100400010
[  288.223047] R10: ffffc90003363e10 R11: 0000000000025400 R12: ffff8888227f6800
[  288.227062] R13: ffff888706896788 R14: ffffc90003363d78 R15: ffff888706896008
[  288.231071] FS:  0000000000000000(0000) GS:ffff888822740000(0000) knlGS:0000000000000000
[  288.236761] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  288.240282] CR2: 00007f5b703af570 CR3: 000000000200a005 CR4: 00000000007606e0
[  288.244306] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  288.248328] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  288.252351] PKRU: 55555554
[  288.254821] Call Trace:
[  288.257178]  queue_io+0x61/0xf0
[  288.259798]  wb_writeback+0x258/0x300
[  288.262600]  ? wb_workfn+0xdf/0x370
[  288.265323]  ? __local_bh_enable_ip+0x6c/0x70
[  288.268370]  wb_workfn+0xdf/0x370
[  288.271040]  ? __switch_to_asm+0x41/0x70
[  288.273944]  ? __switch_to_asm+0x35/0x70
[  288.276845]  process_one_work+0x17b/0x380
[  288.279728]  worker_thread+0x2e/0x390
[  288.282509]  ? process_one_work+0x380/0x380
[  288.285482]  kthread+0x11a/0x130
[  288.288134]  ? kthread_create_on_node+0x70/0x70
[  288.291201]  ret_from_fork+0x35/0x40
[  288.293964] Code: b9 01 00 00 00 0f 44 4c 24 04 89 4c 24 04 49 89 c4 48 8b 45 00 48 39 c5 74 1a 4d 85 f6 4c 8b 6d 08 0f 84 67 ff ff ff 49 8b 45 e0 <49> 39 06 0f 89 5a ff ff ff 8b 44 24 04 85 c0 75 51 48 8b 44 24 
[  293.673669] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 219.
[  293.679790] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 220.

As workaround, we took the offical Ubuntu EKS 1.15 image ami-0f54d80ab3d460266, we added in it the nfs-common package to manage the EFS and we rebuilt a new custom AMI from it.

Note: Using kiam, we had to change the CA certificate path, because the location in Ubuntu is different from the one in AmazonLinux image.

cshivashankar commented 4 years ago

@colandre Do you use stock AMI recommended by AWS or do you bake the AMI, if so what all customizations are you using?

It looks like the issue might be arising due to high resource utilization. It's still not clear for me on why is this happening even when kubelet and all components like docker daemon have resource restrictions set. How can they overload system and cause kernel to panic. Maybe someone from amazon can give a clear picture.

colandre commented 4 years ago

@cshivashankar , I used stock Amazon AMIs:

no customizations. I tried also to force kubeReserved and systemReserved in eksctl configuration to a higher limits compared to the auto-computed one by eksctl (0.19 and above), with no luck. The suggested reservations for my instance type is 300m for the CPU and 300Mi for the memory. I tried 1 core and 1Gb. E.g.

...
kubeletExtraConfig:
        kubeReserved:
            cpu: "1024m"
            memory: "1Gi"
            ephemeral-storage: "1Gi"
        kubeReservedCgroup: "/kube-reserved"
        systemReserved:
            cpu: "1024m"
            memory: "1Gi"
            ephemeral-storage: "1Gi"
        evictionHard:
            memory.available:  "200Mi"
            nodefs.available: "10%"
...
cshivashankar commented 4 years ago

@colandre Did you reach out to AWS support? Any inputs from them? I guess this thread has little visibility as it was created some days back. As per AWS, there should not be any issues if stock AMI is being used. It's definitely worth following up with AWS. Even I am curious about the solution to this problem

colandre commented 4 years ago

@cshivashankar , unfortunately we do not have the Business Plan, so tech issues cannot be opened.

cshivashankar commented 4 years ago

@colandre I think only way to get attention for this thread is to tag collaborator or contributor of this repo :)

eeeschwartz commented 4 years ago

We're cautiously optimistic that upgrading our CNI plugin to 1.6.1 has resolved the issue for us. HTH

jae-63 commented 4 years ago

I work with @eeeschwartz . We now no longer think that the CNI upgrade change fixed this for us. What did fix it, was switching to instances with SSD drives such as r5d.2xlarge, and using that SSD drive for /var/lib/docker as suggested at https://github.com/awslabs/amazon-eks-ami/issues/349

Our actual experience with managing that filesystem was a bit different than suggested in that link. We found that dockerd needs to be stopped/started as part of the userdata script, even though we placed our code fragment in the "pre-userdata", i.e. as the very first steps. The following appears to be working for us:

if ( lsblk | fgrep -q nvme1n1 ); then
   mkfs.ext4 /dev/nvme1n1
   systemctl stop docker
   mkdir -p /var/lib/docker
   mount /dev/nvme1n1 /var/lib/docker
   chmod 711 /var/lib/docker
   systemctl start docker
fi
cshivashankar commented 4 years ago

@jae-63 Moving docker daemon to nvme should ideally provide better performance and IO. Based on your experience of issue , did you see any co-relation to the issue and IOPS of volumes? Did you face any issues with EBS volume of very high IOPS. Trying to understand why the soft lockup issue is arising, ideally docker running on EBS volume should work well.

jae-63 commented 4 years ago

@cshivashankar for cost reasons [or our perceptions of the costs], we didn't pursue IOPS-provisioned EBS volumes, so I don't have any data to provide. I.e. we were previously just using vanilla 'gp2' drives.

I think that one could use IOPS-provisioned (i.e. "io1") drives instead of following our present solution. That would be a bit simpler.

Ephemeral SSDs associated with ephemeral instance like our use-case is a nice match. It seems likely that our (new) SSD drives are overprovisioned w.r.t size, in light of the available CPU/RAM/SSD size ratios. Other users might use more or less disk than us.

BTW we down-sized the root EBS partition to 20GB but even 10GB would probably be sufficient.

cshivashankar commented 4 years ago

Hi @jae-63 Thanks for the info.Any easy ways/instructions for replicating this issue?

Hi @Jeffwan Do you see any potential bug or limitation which is causing these soft lockup issues. It looks like more people are facing these soft lockup issues and some of them are facing it in plain stock AMI which is recommended by AWS. It would be great to know your opinion on what might be causing these issues.

jae-63 commented 4 years ago

@cshivashankar I don't see an easy way to reproduce this. The two affected clusters are mostly used for code building and deployment (primarily the former) using self-hosted Gitlab. As you may know, in ".gitlab-ci.yml" files, one invokes lots of different containers for a brief period of time, e.g. corresponding to building a Docker container. We recently started using the caching capabilities of the 'kaniko-project' https://cloud.google.com/cloud-build/docs/kaniko-cache for many of our builds. I wonder whether that caching could place a heavy disk I/O load on EKS.

I'm pretty sure these are stock AMIs we're using. One cluster is still running EKS 1.15 and the other is EKS 1.16. They are 'ami-065418523a44331e5' and 'ami-0809659d79ce80260' respectively, in us-west-2.

We did open an AWS ticket for this lockup issue a few weeks ago, but not much was accomplished on it, and it's now closed. I don't know whether it's considered OK to post an AWS ticket number here, or perhaps there's a way for us to communicate privately. Perhaps you'd find some clues there.

In principle I'd be willing to revert to using regular 'r5' instances for a few hours (AFTER HOURS) rather than the 'r5d' ones we're currently using. That should reproduce the problem although the load is likely to be much lighter than during the daytime, when our developers are actively using Gitlab. In our experience we continued to observe these lockups after hours. But you'd need to tell me in advance, in detail, what diagnostics to try to collect.

cshivashankar commented 4 years ago

Hi, @jae-63 Even my initial analysis pointed to high IOPS. I did try to simulate the error by subjecting node to high CPU, IOPS but somehow the soft lockup issue doesn't get reproduced in the dev environment. So it's becoming quite tricky to understand what's causing this soft lockup. Earlier I was using following qps flags with kubelet for faster rate of communication with control plane --kube-api-qps=30 --kube-api-burst=50 --event-qps=30 --event-burst=50 --registry-qps=0. After removing these flags , issues got reduced a lot but not completely eliminated. We too use Jenkins, so there might be some co-relation regarding the activities.

I had opened an AWS ticket and working actively with engineer and lack of reproduction steps has become a big hurdle. I am not from amazon, so I don't know if it's ok to share the ticket number, I can check and confirm.

Main diagnostics that can help is kernel crash dump, logs collected from log collector script, and system metrics maybe using SAR or similar tools.

jae-63 commented 4 years ago

Hi @cshivashankar . Sorry I had misunderstood ... I thought you were AWS staff and were seeking customer help to resolve this for the larger community. In light of the fact that you, like me, don't work for AWS I'm going to excuse myself from further activity here, because I don't think I have much to add beyond the workaround I have already provided.

otterley commented 4 years ago

Can those of you experiencing this issue please try the following on your nodes, reboot, and see if you're still experiencing the issue?

sudo amazon-linux-extras install kernel-ng
sudo reboot
otterley commented 4 years ago

Alternatively, for those of you experiencing this issue, if you switch to c4/r4/m4 instances instead of c5/r5/m5, does it make a difference?

cshivashankar commented 4 years ago

encing this issue please try the following on your nodes, reboot, and see if you're still experiencing the issue?

Can you please provide more details on how this solves the issue. I am ok to try this if this solves any known issue or bug.

otterley commented 4 years ago

@cshivashankar There have been a number of cgroup-related bugs fixed in Kernel 5.3+. This is the Linux kernel that comes with the kernel-ng package, which is much later than the default kernel that comes with the EKS-optimized AMI.

cshivashankar commented 4 years ago

Hi @otterley, Thanks for your feedback I will definitely give it a try and update. Is there any bugfixes list available for public to get an idea on fixes? If there is any specific fix that can address the above issue.I am curious to know what exactly is causing this issue, Any insights on this?

otterley commented 4 years ago

Unfortunately I'm not aware of any identifiable bug fixes that directly relate to this issue. But with your assistance we can hopefully narrow this down to a root cause.

cshivashankar commented 4 years ago

Hi, @otterley I will be glad to help. Let me try to test these changes and update.Due to the nature of the issue and difficulty in reproducing, it might take some time to provide an input on if it really solved the issue or not.Meanwhile is there any way I can check the source code of the kernel and try to debug other than the using yumdownloader for the source (https://aws.amazon.com/amazon-linux-2/faqs/.) I assume you are from the AWS kernel team.

otterley commented 4 years ago

I'm actually a Containers Specialist Solutions Architect and am not on the kernel team. I'm just volunteering my time to help out customers like you as best I can. :)

Also, it would be good to know if this issue is reproducible on 4th generation instances.

cshivashankar commented 4 years ago

Thank you, tats great to hear :). I am open to trying out 4th gen instances in a dev environment. The biggest problem for me has been to reproduce the issue. Do you have any ideas for reproducing the issue so that it will be much easier for me to try out any changes? Till now I have tried everything from bombarding CPU , memory IOPS etc, nothing worked :(.

otterley commented 4 years ago

Is the issue still occurring randomly, but you're unable to induce it yourself? Or has the issue resolved itself spontaneously?

cshivashankar commented 4 years ago

I am still getting issues but it's random and less frequent. All of a sudden I see a node will be gone with soft lockup in prod environment. Earlier I was using following kubelet flags related to qps and memory --kube-api-qps=30 --kube-api-burst=50 --event-qps=30 --event-burst=50 --registry-qps=0 --system-reserved=cpu=1,memory=1Gi,ephemeral-storage=5Gi \ which was causing nodes to go off once in a few hours after 1.14 upgrade ( we didnt face any issues in 1.13) . Removing above flags definitely reduced lot of those incidents but hasnt eliminated them completely. I am unable to induce the error myself to debug the issue.

deniss-rostkovskis commented 4 years ago

Hey!

We are running AWS EKS in production with c5.4xlarge nodes with 200 GB GP2 EBS disks and experience, which I think is the same issue. Here is what happens for us and how we solved it.

When there is a new application deployment some EKS nodes start to freeze, AWS EC2 mark instance unhealthy due to missing health checks. We were able to catch one of such nodes, stop it, remount disks to a healthy instance and do log analysis:

Jul 16 17:25:36 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 4, index 1.
[...]
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 4, index 1022.
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 4, index 1023.
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: The number of lost tx completions is above the threshold (387 > 128). Reset the device
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: Trigger reset is on
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: tx_timeout: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: suspend: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: resume: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: wd_expired: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: interface_up: 1
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: interface_down: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: admin_q_pause: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: rx_drops: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: tx_drops: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_cnt: 1167208
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_bytes: 2634729386
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_queue_stop: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_queue_wakeup: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_dma_mapping_err: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_linearize: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_linearize_failed: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_napi_comp: 1467519
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_tx_poll: 1467699
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_doorbells: 830768
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_prepare_ctx_err: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_bad_req_id: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_llq_buffer_copy: 612143
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_missed_tx: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_tx_unmask_interrupt: 1467518
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_cnt: 1923011
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_bytes: 4927238455
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_rx_copybreak_pkt: 440739
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_csum_good: 1828493
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_refil_partial: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_bad_csum: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_page_alloc_fail: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_skb_alloc_fail: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_dma_mapping_err: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_bad_desc_num: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_bad_req_id: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_empty_rx_ring: 0
Jul 16 17:25:38 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_0_rx_csum_unchecked: 86822
[...]
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_cnt: 1582933
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_bytes: 3127738301
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_queue_stop: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_queue_wakeup: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_dma_mapping_err: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_linearize: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_linearize_failed: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_napi_comp: 1957997
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_tx_poll: 1958067
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_doorbells: 954028
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_prepare_ctx_err: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_bad_req_id: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_llq_buffer_copy: 976309
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_missed_tx: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_tx_unmask_interrupt: 1957996
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_cnt: 3087168
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_bytes: 7305947420
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_rx_copybreak_pkt: 437836
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_csum_good: 3087168
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_refil_partial: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_bad_csum: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_page_alloc_fail: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_skb_alloc_fail: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_dma_mapping_err: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_bad_desc_num: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_bad_req_id: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_empty_rx_ring: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: queue_7_rx_csum_unchecked: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: ena_admin_q_aborted_cmd: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: ena_admin_q_submitted_cmd: 48
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: ena_admin_q_completed_cmd: 48
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: ena_admin_q_out_of_space: 0
Jul 16 17:25:39 ip-10-157-90-100 kernel: ena 0000:00:05.0 eth0: ena_admin_q_no_completion: 0
Jul 16 17:25:54 ip-10-157-90-100 dhclient[3773]: XMT: Solicit on eth0, interval 131300ms.
Jul 16 17:25:55 ip-10-157-90-100 kernel: watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [kworker/u32:0:22153]
Jul 16 17:25:55 ip-10-157-90-100 kernel: Modules linked in: ext4 crc16 mbcache jbd2 fscrypto ip6table_nat nf_conntrack_ipv6 nf_nat_ipv6 ip6_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set veth iptable_mangle xt_connmark xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_defrag_ipv6 xt_comment xt_mark xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter bridge stp llc overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper evdev mousedev psmouse cryptd button ena ip_tables x_tables
Jul 16 17:25:55 ip-10-157-90-100 kernel:  xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
Jul 16 17:25:55 ip-10-157-90-100 kernel: CPU: 8 PID: 22153 Comm: kworker/u32:0 Tainted: G             L  4.14.181-140.257.amzn2.x86_64 #1
Jul 16 17:25:55 ip-10-157-90-100 kernel: Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
Jul 16 17:25:55 ip-10-157-90-100 kernel: Workqueue: writeback wb_workfn (flush-259:0)
Jul 16 17:25:55 ip-10-157-90-100 kernel: task: ffff8885b8afc000 task.stack: ffffc9000c140000
Jul 16 17:25:55 ip-10-157-90-100 kernel: RIP: 0010:__list_del_entry_valid+0x28/0x90
Jul 16 17:25:55 ip-10-157-90-100 kernel: RSP: 0018:ffffc9000c143ce8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Jul 16 17:25:55 ip-10-157-90-100 kernel: RAX: ffff8887eb638880 RBX: ffff8887eb638870 RCX: dead000000000200
Jul 16 17:25:55 ip-10-157-90-100 kernel: RDX: ffff88807728a7c8 RSI: ffff88807728ab88 RDI: ffff88807728ab88
Jul 16 17:25:55 ip-10-157-90-100 kernel: RBP: ffff88807728aa98 R08: ffff888077289148 R09: 000000018040003d
Jul 16 17:25:55 ip-10-157-90-100 kernel: R10: ffffc9000c143e10 R11: ffff88820eb37e00 R12: 0000000000000000
Jul 16 17:25:55 ip-10-157-90-100 kernel: R13: ffff8887eb638858 R14: ffff88807728ab88 R15: ffff88807728ab88
Jul 16 17:25:55 ip-10-157-90-100 kernel: FS:  0000000000000000(0000) GS:ffff8887ed400000(0000) knlGS:0000000000000000
Jul 16 17:25:55 ip-10-157-90-100 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 16 17:25:55 ip-10-157-90-100 kernel: CR2: 000000c00022c000 CR3: 000000000200a005 CR4: 00000000007606e0
Jul 16 17:25:55 ip-10-157-90-100 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 16 17:25:55 ip-10-157-90-100 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 16 17:25:55 ip-10-157-90-100 kernel: PKRU: 55555554
Jul 16 17:25:55 ip-10-157-90-100 kernel: Call Trace:
Jul 16 17:25:55 ip-10-157-90-100 kernel:  inode_io_list_move_locked+0x30/0xc0
Jul 16 17:25:55 ip-10-157-90-100 kernel:  __writeback_inodes_wb+0xa2/0xb0
Jul 16 17:25:55 ip-10-157-90-100 kernel:  wb_writeback+0x271/0x300
Jul 16 17:25:55 ip-10-157-90-100 kernel:  ? wb_workfn+0xdf/0x370
Jul 16 17:25:55 ip-10-157-90-100 kernel:  ? __local_bh_enable_ip+0x6c/0x70
Jul 16 17:25:55 ip-10-157-90-100 kernel:  wb_workfn+0xdf/0x370
Jul 16 17:25:55 ip-10-157-90-100 kernel:  ? try_to_wake_up+0x54/0x490
Jul 16 17:25:55 ip-10-157-90-100 kernel:  process_one_work+0x17b/0x380
Jul 16 17:25:55 ip-10-157-90-100 kernel:  worker_thread+0x2e/0x390
Jul 16 17:25:55 ip-10-157-90-100 kernel:  ? process_one_work+0x380/0x380
Jul 16 17:25:55 ip-10-157-90-100 kernel:  kthread+0x11a/0x130
Jul 16 17:25:55 ip-10-157-90-100 kernel:  ? kthread_create_on_node+0x70/0x70
Jul 16 17:25:55 ip-10-157-90-100 kernel:  ret_from_fork+0x35/0x40
Jul 16 17:25:55 ip-10-157-90-100 kernel: Code: 00 00 00 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57 08 48 39 c8 74 26 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2b 48 8b 32 <48> 39 fe 75 34 48 8b 50 08 48 39 f2 75 3f b8 01 00 00 00 c3 48

At that point, instance loses network connection, as a result EKS can't communicate with it and EC2 health-check fails.

I found similar issues: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791 https://github.com/amzn/amzn-drivers/issues/27

As @cshivashankar mentioned issue can be reproduced having higher IOPS - this is exactly what is happening during our deployment process.

How we solved it We switched from c5.4xlarge to c5d.4xlarge nodes - nodes with NVMe disk and moved docker and kubelet directories to that disk:


mkfs.xfs /dev/nvme1n1
mkdir -p /mnt/instance-store
mount -t xfs -o noatime,inode64,noquota /dev/nvme1n1 /mnt/instance-store/

mkdir -p /mnt/instance-store/kubelet
mkdir -p /mnt/instance-store/docker

systemctl stop kubelet && systemctl stop docker

rsync -ahW /var/lib/kubelet/ /mnt/instance-store/kubelet/
rsync -ahW /var/lib/docker/ /mnt/instance-store/docker/

mount -o bind /mnt/instance-store/docker/ /var/lib/docker/
mount -o bind /mnt/instance-store/kubelet/ /var/lib/kubelet/

systemctl start kubelet && systemctl start docker

UPDATED: do not use the solution above ^^^ it is incomplete and could lead to errors, the updated version was published a few comments below.

During deployment, we see IOPS spikes up to 300 000 ops/s. When we were using 200 GB GP2 disks IOPS was limited to 3000 ops/s.

This doesn't explain the root cause but confirm it is related to a high IOPS and network problems. After that switch, we stopped to experience the node freeze problem.

Regards, Deniss Rostkovskis

cshivashankar commented 4 years ago

Hi, @deniss-rostkovskis Thanks for the info. Are you able to reproduce the issue consistently? I tried a lot to reproduce the issue in the dev environment but I wasn't successful. Bombarding the disk with high IO activity was one of the many such attempts I tried to reproduce the issue. Even the amazon support team was not able to reproduce it. Any information on this regard will be helpful.

Ideally, with high IOPS, there should be wait rather than soft lockup, which might indicate some bug in the kernel.I had raised other bug for ENA and the amazon team confirmed that it was not an issue with ENA.

deniss-rostkovskis commented 4 years ago

@cshivashankar, yes we can reproduce consistently, but can't trace back to the exact command causing the issue. Most probably it is happening during the file copy process - we copy ~1.5 GB of ~130k small files (1-10 KB).

We were communicating with AWS support for months without any radical progress in solving this. As the issue was resolved with the switch to the instance store NVMe disks, we are no longer actively debugging it, at least for now.

Deniss

jae-63 commented 4 years ago

@deniss-rostkovskis our simpler solution (from May 28th, above) is mostly working well for us, but Kubernetes Garbage Collection https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/ doesn't clean up stale images on /var/lib/docker, which for us is the only directory on the SSD drive. I wonder whether your more complex solution fixes that, by making Kubelet co-located on the SSD drive. Have you observed disks filling up? Are you able to see logs of successful Garbage Collection of stale docker images? Thanks ...

deniss-rostkovskis commented 4 years ago

Hey!

A small update - we are running the solution on production for several weeks now and the node freeze problem never appeared.

We made several improvements and learned from mistakes:

We are using the terraform AWS EKS module to create a cluster. Here is how our final solution looks like now:

lib/cluster-custom-userdata.tpl:

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

cloud-init-per once installreqs yum -y install lvm2

cloud-init-per once pvcreate pvcreate /dev/nvme1n1
cloud-init-per once vgcreate vgcreate eks /dev/nvme1n1

cloud-init-per once lvdocker lvcreate --name docker --size 200GB eks
cloud-init-per once lvkubelet lvcreate --name kubelet -l 100%FREE eks

cloud-init-per once format_docker mkfs.xfs /dev/eks/docker
cloud-init-per once format_kubelet mkfs.xfs /dev/eks/kubelet

cloud-init-per once create_pods_dir mkdir -p /var/lib/kubelet/pods

cloud-init-per once mount_docker mount -t xfs -o noatime,inode64,noquota /dev/eks/docker /var/lib/docker
cloud-init-per once mount_kubelet mount -t xfs -o noatime,inode64,noquota /dev/eks/kubelet /var/lib/kubelet/pods

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash -xe

# Allow user supplied pre userdata code
${pre_userdata}

# Bootstrap and join the cluster
/etc/eks/bootstrap.sh --b64-cluster-ca '${cluster_auth_base64}' --apiserver-endpoint '${endpoint}' ${bootstrap_extra_args} --kubelet-extra-args "${kubelet_extra_args}" '${cluster_name}'

# Allow user supplied userdata code
${additional_userdata}

--==BOUNDARY==--

cluster.tf:

module "cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "12.2.0"

  cluster_version = "1.17"

  [...]

  worker_groups = [
    {
      userdata_template_file = file("${path.module}/lib/cluster-custom-userdata.tpl")

      [...]
    }
  ]
}

Huge kudos to AWS engineer @raonitimo who was assisting us with this issue, working with him was a pleasure and the most enjoyable experience with AWS support by far.

Regards, Deniss Rostkovskis

cshivashankar commented 4 years ago

Hi @deniss-rostkovskis Glad to hear your issue is resolved. I assume the issues can be solved in the nodes which have SSD volumes. For example c5d.18xlarge. Is there any solution which can work on any type of nodes, even without SSD volumes.

Any information on the actual RCA for this issue. Is it any known bug in kernel?

Regards, Chetan

deniss-rostkovskis commented 4 years ago

@cshivashankar, I would love to have answers on the questions you are asking, but I don't.

AWS support forwarded this information to the EKS Service Team and assured me that they are currently investigating how to fix this in the AMIs. Hopefully, we will see a fix in the upcoming releases.

Deniss

cshivashankar commented 4 years ago

Thanks @deniss-rostkovskis for the information. Keeping fingers crossed and hopefully it will be resolved soon.

JacobHenner commented 4 years ago

I appear to be encountering the same issue on multiple m5.2xlarge hosts. I've opened support ticket 7271520151 which contains the dmesg log. The nodes affected are all running v1.16.13-eks-2ba888

jae-63 commented 4 years ago

@deniss-rostkovskis are you able to ascertain whether Kubernetes Garbage Collection runs in your environments? It seems in principle that you are intervening early enough in the lifetime of Docker and k8s that it might. I would adopt your approach (replacing my own, described above on May 28th) if k8s Garbage Collection is working for you.

deniss-rostkovskis commented 4 years ago

@jae-63, not sure how exactly I can 100% verify if Kubernetes Garbage Collection is working, but I can't see any GC related errors in kubectl get events and kubectl describe node reports "kubelet has no disk pressure" for all nodes.

brettplarson commented 4 years ago

Hello, We seem to be experiencing the same issue.

We have a few pods that use init containers to do an initial git clone of a large git repo to an emptyDir volume.

We are seeing that this causes a node to become completely unresponsive 95% of the time and cause the node to completely lock up.

We are using r5.4xlarge nodes with 1TB gp2 volumes.

Here is the trace from the logs:

Sep 08 16:58:56 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:58:56.701798   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 16:59:06 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:06.725821   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 16:59:16 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:16.731322   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 16:59:21 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:21.544749   19173 kubelet.go:1309] Image garbage collection succeeded
Sep 08 16:59:21 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:21.812250   19173 container_manager_linux.go:457] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Sep 08 16:59:26 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:26.736244   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 16:59:34 ip-10-XXX-XXX-XXX.aws- dhclient[3436]: XMT: Solicit on eth0, interval 124200ms.
Sep 08 16:59:36 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:36.741129   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 16:59:46 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:46.746115   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 16:59:56 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 16:59:56.751128   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:00:06 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:06.764650   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:00:16 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:16.769794   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:00:26 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:26.774872   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:00:36 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:36.779731   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:00:46 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:46.784730   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:00:53 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:53.862501   19173 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock" (plugin details: &{/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock true 2020-09-08 16:54:38.205247758 +0000 UTC m=+27.394358812})
Sep 08 17:00:53 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: E0908 17:00:53.863058   19173 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock" failed. No retries permitted until 2020-09-08 17:02:55.863016558 +0000 UTC m=+525.052127632 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- failed to get plugin info using RPC GetInfo at socket /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"
Sep 08 17:00:56 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:00:56.789916   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- systemd[1]: Created slice User Slice of root.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- systemd[1]: Starting User Slice of root.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- systemd[1]: Started Session 2 of user root.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- systemd[1]: Starting Session 2 of user root.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- CROND[27605]: (root) CMD (run-parts /etc/cron.hourly)
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- run-parts(/etc/cron.hourly)[27608]: starting 0anacron
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- anacron[27614]: Anacron started on 2020-09-08
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- anacron[27614]: Will run job `cron.daily' in 22 min.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- anacron[27614]: Will run job `cron.weekly' in 42 min.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- anacron[27614]: Will run job `cron.monthly' in 62 min.
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- anacron[27614]: Jobs will be executed sequentially
Sep 08 17:01:01 ip-10-XXX-XXX-XXX.aws- run-parts(/etc/cron.hourly)[27616]: finished 0anacron
Sep 08 17:01:06 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:01:06.803601   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:14 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: W0908 17:01:14.347008   19173 reflector.go:302] object-"kube-system"/"kube-proxy-config": watch of *v1.ConfigMap ended with: too old resource version: 23621939 (23622891)
Sep 08 17:01:16 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:01:16.808674   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:19 ip-10-XXX-XXX-XXX.aws- containerd[3527]: time="2020-09-08T17:01:19.978145037Z" level=info msg="shim reaped" id=b53115f8cdabde25266096c9dc00fd1d566e6ee967ac1295aaebe6cd09bb9f44
Sep 08 17:01:19 ip-10-XXX-XXX-XXX.aws- dockerd[3884]: time="2020-09-08T17:01:19.988042078Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 08 17:01:19 ip-10-XXX-XXX-XXX.aws- dockerd[3884]: time="2020-09-08T17:01:19.988126742Z" level=warning msg="b53115f8cdabde25266096c9dc00fd1d566e6ee967ac1295aaebe6cd09bb9f44 cleanup: failed to unmount IPC: umount /var/lib/docker/containers/b53115f8cdabde25266096c9dc00fd1d566e6ee967ac1295aaebe6cd09bb9f44/mounts/shm, flags: 0x2: no such file or directory"
Sep 08 17:01:26 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:01:26.833592   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:36 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:01:36.838603   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:38 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 475.
Sep 08 17:01:39 ip-10-XXX-XXX-XXX.aws- dhclient[3436]: XMT: Solicit on eth0, interval 124030ms.
Sep 08 17:01:41 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 476.
Sep 08 17:01:44 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 477.
Sep 08 17:01:44 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 478.
Sep 08 17:01:46 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:01:46.843534   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 479.
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 480.
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 481.
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 482.
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 483.
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 484.
Sep 08 17:01:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 485.
Sep 08 17:01:50 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 486.
Sep 08 17:01:55 ip-10-XXX-XXX-XXX.aws- amazon-ssm-agent[3723]: 2020-09-08 17:01:55 INFO [MessagingDeliveryService] [Association] Schedule manager refreshed with 0 associations, 0 new associations associated
Sep 08 17:01:56 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 487.
Sep 08 17:01:56 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:01:56.848497   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:01:59 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 488.
Sep 08 17:01:59 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 489.
Sep 08 17:02:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 490.
Sep 08 17:02:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 491.
Sep 08 17:02:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 492.
Sep 08 17:02:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 493.
Sep 08 17:02:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 494.
Sep 08 17:02:05 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 495.
Sep 08 17:02:05 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 496.
Sep 08 17:02:05 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 497.
Sep 08 17:02:05 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 498.
Sep 08 17:02:05 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 499.
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel: BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 45s!

Message from syslogd@ip-10-XXX-XXX-XXX at Sep  8 17:02:06 ...
 kernel:BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 45s!
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel: Showing busy workqueues and worker pools:
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel: workqueue mm_percpu_wq: flags=0x8
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel:   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel:     pending: vmstat_update
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel: workqueue writeback: flags=0x4e
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel:   pwq 32: cpus=0-15 flags=0x4 nice=0 active=2/256 refcnt=4
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel:     in-flight: 106:wb_workfn wb_workfn
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kernel: pool 32: cpus=0-15 flags=0x4 nice=0 hung=0s workers=4 idle: 1917 110 27370
Sep 08 17:02:06 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:06.862045   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:02:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 500.
Sep 08 17:02:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 501.
Sep 08 17:02:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 502.
Sep 08 17:02:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 503.
Sep 08 17:02:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 504.
Sep 08 17:02:11 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 505.
Sep 08 17:02:16 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: W0908 17:02:16.215402   19173 reflector.go:302] object-"dssadmin"/"gitlab": watch of *v1.ConfigMap ended with: too old resource version: 23622573 (23623057)
Sep 08 17:02:16 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:16.866970   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:02:17 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 506.
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: INFO: rcu_sched self-detected stall on CPU
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:         5-...: (14749 ticks this GP) idle=512/140000000000001/0 softirq=16775/16775 fqs=6347 
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:          (t=14750 jiffies g=20977 c=20976 q=25810)
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: NMI backtrace for cpu 5
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: CPU: 5 PID: 106 Comm: kworker/u32:1 Not tainted 4.14.186-146.268.amzn2.x86_64 #1
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: Hardware name: Amazon EC2 r5.4xlarge/, BIOS 1.0 10/16/2017
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: Workqueue: writeback wb_workfn (flush-259:0)
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: Call Trace:
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  <IRQ>
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  dump_stack+0x66/0x82
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  nmi_cpu_backtrace+0x89/0x90
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? irq_force_complete_move+0x120/0x120
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  nmi_trigger_cpumask_backtrace+0xda/0x120
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  rcu_dump_cpu_stacks+0x9e/0xd2
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  rcu_check_callbacks+0x696/0x8e0
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? update_wall_time+0x474/0x6f0
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? tick_sched_do_timer+0x40/0x40
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  update_process_times+0x28/0x50
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  tick_sched_handle+0x22/0x70
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  tick_sched_timer+0x34/0x70
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  __hrtimer_run_queues+0xde/0x230
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  hrtimer_interrupt+0xa6/0x1f0
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  smp_apic_timer_interrupt+0x6e/0x130
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  apic_timer_interrupt+0x89/0x90
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  </IRQ>
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: RIP: 0010:move_expired_inodes+0xee/0x230
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: RSP: 0018:ffffc900065ffcc8 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff10
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: RAX: ffff889f508909c8 RBX: ffffc900065ffce8 RCX: dead000000000200
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: RDX: 0000000000000000 RSI: ffffc900065ffce8 RDI: ffff889d3416d108
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: RBP: ffff889f615e1870 R08: ffff889d3416e008 R09: 0000000180400019
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: R10: ffffc900065ffe10 R11: 0000000000025400 R12: ffff889f629fa800
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel: R13: ffff889d3416d108 R14: ffffc900065ffd78 R15: ffff889d3416e008
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? move_expired_inodes+0x90/0x230
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  queue_io+0x61/0xf0
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  wb_writeback+0x258/0x300
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? wb_workfn+0xdf/0x370
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? __local_bh_enable_ip+0x6c/0x70
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  wb_workfn+0xdf/0x370
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? __switch_to_asm+0x41/0x70
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? __switch_to_asm+0x35/0x70
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  process_one_work+0x17b/0x380
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  worker_thread+0x2e/0x390
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? process_one_work+0x380/0x380
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  kthread+0x11a/0x130
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ? kthread_create_on_node+0x70/0x70
Sep 08 17:02:19 ip-10-XXX-XXX-XXX.aws- kernel:  ret_from_fork+0x35/0x40
Sep 08 17:02:20 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 507.
Sep 08 17:02:20 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 508.
Sep 08 17:02:23 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 509.
Sep 08 17:02:23 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 510.
Sep 08 17:02:26 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 511.
Sep 08 17:02:26 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 512.
Sep 08 17:02:26 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:26.871947   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:02:32 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 513.
Sep 08 17:02:35 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 514.
Sep 08 17:02:35 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 515.
Sep 08 17:02:35 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 516.
Sep 08 17:02:35 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 517.
Sep 08 17:02:35 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 518.
Sep 08 17:02:35 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 519.
Sep 08 17:02:36 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:36.876650   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:02:38 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 520.
Sep 08 17:02:41 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: W0908 17:02:41.213440   19173 reflector.go:302] object-"dssadmin"/"spark-versions": watch of *v1.ConfigMap ended with: too old resource version: 23622573 (23623134)
Sep 08 17:02:41 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 521.
Sep 08 17:02:43 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: W0908 17:02:43.547656   19173 reflector.go:302] object-"kube-system"/"dss-fluentd": watch of *v1.ConfigMap ended with: too old resource version: 23622205 (23623134)
Sep 08 17:02:44 ip-10-XXX-XXX-XXX.aws- amazon-ssm-agent[3723]: 2020-09-08 17:02:44 INFO [HealthCheck] HealthCheck reporting agent health.
Sep 08 17:02:44 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 522.
Sep 08 17:02:46 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:46.881262   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:02:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 523.
Sep 08 17:02:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 524.
Sep 08 17:02:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 525.
Sep 08 17:02:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 526.
Sep 08 17:02:47 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 527.
Sep 08 17:02:50 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 528.
Sep 08 17:02:50 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 529.
Sep 08 17:02:50 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 530.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 531.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 532.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 533.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 534.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 535.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 536.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 537.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 538.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 539.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 540.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 541.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 542.
Sep 08 17:02:53 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 543.
Sep 08 17:02:55 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:55.877681   19173 reconciler.go:156] operationExecutor.RegisterPlugin started for plugin at "/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock" (plugin details: &{/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock true 2020-09-08 16:54:38.205247758 +0000 UTC m=+27.394358812})
Sep 08 17:02:55 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: E0908 17:02:55.878205   19173 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock" failed. No retries permitted until 2020-09-08 17:04:57.87818206 +0000 UTC m=+647.067293123 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- failed to get plugin info using RPC GetInfo at socket /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"
Sep 08 17:02:56 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 544.
Sep 08 17:02:56 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 545.
Sep 08 17:02:56 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 546.
Sep 08 17:02:56 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 547.
Sep 08 17:02:56 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:02:56.886199   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:02:59 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 548.
Sep 08 17:03:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 549.
Sep 08 17:03:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 550.
Sep 08 17:03:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 551.
Sep 08 17:03:02 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 552.
Sep 08 17:03:06 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: W0908 17:03:06.346456   19173 reflector.go:302] object-"kube-system"/"kube-proxy": watch of *v1.ConfigMap ended with: too old resource version: 23621939 (23623228)
Sep 08 17:03:06 ip-10-XXX-XXX-XXX.aws- kubelet[19173]: I0908 17:03:06.899844   19173 setters.go:73] Using node IP: "10.XXX.XXX.XXX"
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 107s!

Message from syslogd@ip-10-XXX-XXX-XXX at Sep  8 17:03:08 ...
 kernel:BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 107s!
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: Showing busy workqueues and worker pools:
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: workqueue mm_percpu_wq: flags=0x8
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel:   pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel:     pending: vmstat_update
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: workqueue writeback: flags=0x4e
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel:   pwq 32: cpus=0-15 flags=0x4 nice=0 active=2/256 refcnt=4
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel:     in-flight: 106:wb_workfn wb_workfn
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: pool 32: cpus=0-15 flags=0x4 nice=0 hung=0s workers=4 idle: 1917 110 27370
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 553.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 554.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 555.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 556.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 557.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 558.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 559.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 560.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 561.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 562.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 563.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 564.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 565.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 566.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 567.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 568.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 569.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 570.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 571.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 572.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 573.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 574.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 575.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 576.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 577.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 578.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 579.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 580.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 581.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 582.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 583.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 584.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 585.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 586.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 587.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 588.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 589.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 590.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 591.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 592.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 593.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 594.
Sep 08 17:03:08 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 595.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 596.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 597.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 598.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 599.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 600.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 601.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 602.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 603.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 604.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 605.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 606.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 607.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 608.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 609.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 610.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 611.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 612.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 613.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 614.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 615.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 616.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 617.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 618.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 619.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 620.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 621.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 622.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 623.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 624.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 625.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 626.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 627.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 628.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 629.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 630.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 631.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 632.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 633.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 634.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 635.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 636.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 637.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 638.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 639.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 640.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 641.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 642.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 643.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 644.
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: The number of lost tx completions is above the threshold (170 > 128). Reset the device
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: Trigger reset is on
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: tx_timeout: 0
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: suspend: 0
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: resume: 0
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: wd_expired: 0
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: interface_up: 1
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: interface_down: 0
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: admin_q_pause: 0
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: queue_0_tx_cnt: 4797
Sep 08 17:03:09 ip-10-XXX-XXX-XXX.aws- kernel: ena 0000:00:05.0 eth0: queue_0_tx_bytes: 801717

Unfortunately our experience with AWS support has been less than stellar so far, I am happy to provide more information on this issue so that we can resolve the core of the issue.

brettplarson commented 4 years ago

Can those of you experiencing this issue please try the following on your nodes, reboot, and see if you're still experiencing the issue?

sudo amazon-linux-extras install kernel-ng
sudo reboot

This seems to work after testing it a few times.

robertgates55 commented 4 years ago

@deniss-rostkovskis We think we're hitting this issue too in our prod cluster. Are there any telltale metrics or logs that we could look for to confirm it's almost certainly the same thing? (there's quite a bit on this thread now so wanted to confirm)

raravena80 commented 4 years ago

We are seeing this issue. Can we have a patched-up AMI from EKS asap? Thanks.

mmerkes commented 4 years ago

Upgrading the kernel from 4.14 to 5.4 to seems to resolve the issue for customers. The one risk of the amazon-linux-extras install kernel-ng method is that kernel-ng refers to the next generation kernel, so it's not going to guarantee you'll always be on the 5.4 kernel. At some point in the future, it could get bumped to 5.9. Also, EKS doesn't run conformance tests on the 5.4 kernel, so we can't provide the same guarantees, though it's officially supported by AmazonLinux.

We've been struggling to get a repro of the issue, but we've been working with the AmazonLinux team and have found a few kernel patches that we suspect may fix the issue. We're working on a plan to get those out and figuring out how to test if they resolve the issue or not. I will post an update here when I have more information.

mmerkes commented 4 years ago

The AmazonLinux team has pulled in a few patches that we suspect might resolve the soft lockup issue. However, without a successful repro on our end, we won't know for certain unless someone with a known repro is able to verify that it resolves the issue. The patches will be available via yum update kernel once it's available in yum. It will be 4.14.198.*, the most recent kernel being 4.14.193-149.317.amzn2. I will post here when I see it's available.

Once the update is available, I will work on releasing new AMIs with the updated kernel. At that point, you'll be able to use the latest EKS AMIs to get the patches.

cshivashankar commented 4 years ago

Hi @mmerkes Is it right to assume that amazon Linux kernel uses upstream linux https://github.com/torvalds/linux ? If so do you think moving to version 5 would provide a lot of advantages? As this issue is mainly related to a kernel panic, debugging the kernel could be key. Do you suggest anyways to do that?

mmerkes commented 4 years ago

Kernel 4.14.198-152.320.amzn2 includes patches that we hope resolves the some of the issues and is now available via yum update kernel. As an alternative to upgrading to 5.4, customers can try updating the kernel with yum. On the AWS side, we're planning on building new EKS AMIs with the patches and making those available ASAP, but that will take a little time. I'll post here when those AMIs are published.

sudo yum update -y kernel
sudo reboot

If anyone who has a repro of this tries updating the kernel, let me know! We're trying to figure out if this resolves the issue or not.

@cshivashankar Yes and no. AmazonLinux, I believe, uses a fork of the kernel, which is the upstream kernel plus some custom patches. In this case, the patches were backported to 4.14 upstream, so AmazonLinux did a rebase with upstream to get the changes. I'm not sure what all the advantages are as I'm not a kernel expert. At some point, EKS will start using the 5.4 kernel in the AMIs we manage. No specific dates, but that's our intention.

mmerkes commented 4 years ago

EKS has published AMIs with the 4.14.198-152.320.amzn2 kernel, which includes the patches that may resolve this issue. As an alternative to the yum update, you can replace your nodes using the latest EKS AMIs. If you use EKS managed node groups, you can use the update-nodegroup-version API. These links should take you to the right AMIs, published 10/2/20, though not all of the documentation has been updated yet with the latest versions.

JacobHenner commented 4 years ago

Continued to see this issue after updating to the newest AMI:

# uname -a
Linux ip-10-12-105-157.ec2.internal 4.14.198-152.320.amzn2.x86_64 #1 SMP Wed Sep 23 23:57:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[  467.626433] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u16:4:103]
[  467.631670] Modules linked in: iptable_security iptable_raw xt_multiport veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_nat xt_statistic ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_defrag_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_conntrack xt_comment xt_mark xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter bridge stp llc overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc pcc_cpufreq aesni_intel aes_x86_64 crypto_simd glue_helper cryptd mousedev psmouse evdev ena button ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
[  467.663395] CPU: 5 PID: 103 Comm: kworker/u16:4 Not tainted 4.14.198-152.320.amzn2.x86_64 #1
[  467.668873] Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017
[  467.672456] Workqueue: writeback wb_workfn (flush-259:0)
[  467.675605] task: ffff888802060000 task.stack: ffffc90003468000
[  467.678961] RIP: 0010:move_expired_inodes+0x55/0x210
[  467.682001] RSP: 0018:ffffc9000346bcd8 EFLAGS: 00000216 ORIG_RAX: ffffffffffffff10
[  467.687148] RAX: 000000010000a448 RBX: ffffc9000346bcf0 RCX: dead000000000200
[  467.690889] RDX: 0000000000020007 RSI: ffffc9000346bcf0 RDI: ffff88848f4bbde0
[  467.694651] RBP: ffff8888022a2070 R08: ffff88848f4bba88 R09: 0000000180400028
[  467.698437] R10: ffffc9000346be18 R11: 0000000000025400 R12: ffff888801331800
[  467.702204] R13: ffff88848f4bbde0 R14: 0000000000000000 R15: ffff88848f4bc208
[  467.706003] FS:  0000000000000000(0000) GS:ffff888822740000(0000) knlGS:0000000000000000
[  467.711371] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  467.714740] CR2: 00007f3eeffbc1d8 CR3: 000000000200a004 CR4: 00000000007606e0
[  467.718520] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  467.722318] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  467.726071] PKRU: 55555554
[  467.728387] Call Trace:
[  467.730602]  queue_io+0x66/0x110
[  467.733064]  wb_writeback+0x253/0x300
[  467.735696]  ? wb_workfn+0xdf/0x370
[  467.738242]  wb_workfn+0xdf/0x370
[  467.740745]  ? __switch_to_asm+0x35/0x70
[  467.743429]  process_one_work+0x17b/0x380
[  467.746167]  worker_thread+0x2e/0x390
[  467.748794]  ? process_one_work+0x380/0x380
[  467.751611]  kthread+0x11a/0x130
[  467.754070]  ? kthread_create_on_node+0x70/0x70
[  467.756975]  ret_from_fork+0x35/0x40
[  467.759547] Code: 48 89 0c 24 c7 44 24 08 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 48 89 5c 24 18 48 89 5c 24 20 e9 9e 00 00 00 <4c> 89 ff e8 e3 07 11 00 84 c0 74 0e 49 8b 17 49 8b 47 08 48 89 
[  476.230407] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 5, index 150.
SQUIDwarrior commented 4 years ago

Same here @mmerkes:

Message from syslogd@ip-10-134-142-241 at Oct  6 17:30:16 ...
 kernel:watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [kworker/u66:4:1170]

[mike.deats@ip-10-134-142-241 ~]$ uname -a
Linux ip-10-134-142-241.ec2.internal 4.14.198-152.320.amzn2.x86_64 #1 SMP Wed Sep 23 23:57:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
mmerkes commented 4 years ago

@SQUIDwarrior @JacobHenner Sadness :( Thanks for trying out the latest kernel and letting us know! We'll keep at it. I will go back to the AmazonLinux team with this info, and we'll go back to the drawing board.