Flaky network with CPU soft lockups after upgrade to 31.20200323.2.0

wcurry commented 3 years ago

Describe the bug Network is flaky after upgrade from 31.20200310.3.0 to 31.20200323.2.0.

Reproduction steps Steps to reproduce the behavior:

Bump version to 31.20200323.2.0 and deploy new Kubernetes cluster.
Hit apiserver through deployment or e2e testing
Observe etcdserver timeout errors and in some cases etcd loss of quorum

Expected behavior

CPU should not lockup. Network should deliver all packets. Network interface should not reset.

Actual behavior

etcdserver timeout errors are received from clients.
etcd server logs slow responses.
rafthttp timeouts over host network to other etcd members eventually lead to loss of quorum.

System details

AWS i3.xlarge
Fedora CoreOS 31.20200323.2.0 (ami-061e1a63e880fe714)
NVME and io1 volumes attached

Ignition config

{"ignition":{"config":{"replace":{"source":null,"verification":{}}},"security":{"tls":{}},"timeouts":{},"version":"3.0.0"},"passwd":{},"storage":{"disks":[{"device":"/dev/nvme0n1","partitions":[{"label":"ETCDDATAFS"}],"wipeTable":true},{"device":"/dev/xvdp","partitions":[{"label":"IMAGEFS","sizeMiB":0}],"wipeTable":true},{"device":"/dev/xvdc","partitions":[{"label":"NODEFS","sizeMiB":0}],"wipeTable":true}],"files":[{"group":{},"path":"/etc/systemd/network/50-flannel.link","user":{},"contents":{"source":"data:,%5BMatch%5D%0AOriginalName%3Dflannel*%0A%5BLink%5D%0AMACAddressPolicy%3Dnone%0A","verification":{}}},{"group":{},"path":"/etc/docker/daemon.json","user":{},"contents":{"source":"data:,xxx","verification":{}},"mode":420},{"group":{},"path":"/root/.docker/config.json","user":{},"contents":{"source":"data:,xxx","verification":{}},"mode":493},{"group":{},"overwrite":true,"path":"/etc/sysconfig/docker","user":{},"contents":{"source":"data:,OPTIONS%3D%22--selinux-enabled%20%5C%0A--log-driver%3Djson-file%20%5C%0A--log-opt%20max-size%3D20m%20%5C%0A--log-opt%20max-file%3D10%20%5C%0A--live-restore%20%5C%0A--init-path%20%2Fusr%2Flibexec%2Fdocker%2Fdocker-init%20%5C%0A--userland-proxy-path%20%2Fusr%2Flibexec%2Fdocker%2Fdocker-proxy%20%5C%0A%22%0A","verification":{}}},{"group":{},"overwrite":true,"path":"/etc/selinux/config","user":{},"contents":{"source":"data:,SELINUX%3Ddisabled%0ASELINUXTYPE%3Dtargeted%0A","verification":{}}},{"group":{},"path":"/var/lib/torcx/store/crio:v1.14.master-906b95a-a.torcx.tgz","user":{},"contents":{"source":"s3://bucket/binaries/crio:v1.14.master-906b95a-a.torcx.tgz","verification":{}}},{"group":{},"path":"/etc/kubernetes/kubelet/kubeconfig","user":{},"contents":{"source":"data:,apiVersion%3A%20v1%0Akind%3A%20Config%0Aclusters%3A%0A-%20name%3A%20local%0A%20%20cluster%3A%0A%20%20%20%20server%3A%20https%3A%2F%2Faws-1906064%0A%20%20%20%20certificate-authority%3A%20%2Fetc%2Fkubernetes%2Fca.crt%0Ausers%3A%0A-%20name%3A%20kubelet%0A%20%20user%3A%0A%20%20%20%20client-certificate%3A%20%2Fetc%2Fkubernetes%2Ftls.crt%0A%20%20%20%20client-key%3A%20%2Fetc%2Fkubernetes%2Ftls.key%0Acontexts%3A%0A-%20context%3A%0A%20%20%20%20cluster%3A%20local%0A%20%20%20%20user%3A%20kubelet%20%20%0A","verification":{}}},{"group":{},"path":"/etc/sysctl.d/90-kubelet.conf","user":{},"contents":{"source":"data:,vm.overcommit_memory%3D1%0Akernel.panic%3D10%0Akernel.panic_on_oops%3D1%0Anet.ipv4.ip_local_port_range%3D35000%2065535%0A","verification":{}}}],"filesystems":[{"device":"/dev/nvme0n1","format":"xfs","path":"/dev/nvme0n1","wipeFilesystem":true},{"device":"/dev/xvdp","format":"xfs","path":"/dev/xvdp","wipeFilesystem":true},{"device":"/dev/xvdc","format":"xfs","path":"/dev/xvdc","wipeFilesystem":true}]},"systemd":{"units":[{"mask":true,"name":"zincati.service"},{"enabled":true,"name":"fstrim.timer"},{"contents":"[Unit]\nDescription=Mount instance storage to /var/lib/etcd\n[Mount]\nWhat=/dev/nvme0n1\nWhere=/var/lib/etcd\nType=xfs\n","name":"var-lib-etcd.mount"},{"contents":"[Unit]\nDescription=Mount ephemeral to /var/lib/docker\n[Mount]\nWhat=/dev/xvdp\nWhere=/var/lib/docker\nType=xfs\n","name":"var-lib-docker.mount"},{"contents":"[Unit]\nDescription=Mount ephemeral to /var/lib/kubelet\n[Mount]\nWhat=/dev/xvdc\nWhere=/var/lib/kubelet\nType=xfs\n","name":"var-lib-kubelet.mount"},{"contents":"[Unit]","enabled":true,"name":"generate-etcd-certs.service"},{"contents":"[Unit]","enabled":true,"name":"etcd-member-new.service"},{"contents":"[Unit]\nDescription=Container runtime slice\nBefore=slices.target\n[Slice]\nCPUShares=1024\n[Install]\nWantedBy=multi-user.target\n","enabled":true,"name":"kubeletreserved.slice"},{"contents":"[Unit]\nDescription=Container runtime slice\nBefore=slices.target\n[Slice]\n[Install]\nWantedBy=multi-user.target\n","enabled":true,"name":"kubeletreserved-kubelet.slice"},{"dropins":[{"contents":"[Unit]\nAfter=var-lib-docker.mount\nRequires=var-lib-docker.mount\n[Service]\nSlice=kubeletreserved.slice\n","name":"20-await-var-lib-docker-mount.conf"}],"enabled":true,"name":"docker.service"},{"contents":"[Unit]\nDescription=Clean up hanging runc processes\n[Service]\nExecStart=/bin/bash -c \"pids=$(ps -eo etime,pid,cmd --sort=start_time | grep -v containerd-shim | grep runc |  grep -e '[0-9]*:[0-5][0-9]:[0-5][0-9]' | awk '{ print $2 }') && if [[ ! -z \\\"$pids\\\" ]]; then /bin/echo Cleaning up the following runc processes: $pids && /bin/kill $pids; fi\"\nRestart=always\nRestartSec=300\n[Install]\nWantedBy=multi-user.target\n","enabled":true,"name":"runc-cleanup.service"},{"contents":"[Unit]\nDescription=Setup secure routing\nRequires=afterburn.service\nAfter=afterburn.service\n[Service]\nEnvironmentFile=/run/metadata/afterburn\nExecStartPre=/usr/sbin/iptables -t nat -A PREROUTING -d 169.254.169.254/32 -i cni0 -p tcp -m tcp --dport 80 -j DNAT --to-destination ${AFTERBURN_AWS_IPV4_LOCAL}:9000\nExecStart=/bin/echo \"Secure iptable rules\"\nRemainAfterExit=true\nRestart=on-failure\nRestartSec=10\n[Install]\nWantedBy=multi-user.target\n","enabled":true,"name":"security-setup.service"},{"contents":"[Unit]","enabled":true,"name":"generate-kubelet-certs.service"},{"contents":"[Unit]","enabled":true,"name":"kubelet.service"},{"contents":"[Unit]\nDescription=Enable SMT on first boot on Intel CPUs to disable MDS mitigation\nDefaultDependencies=no\nBefore=sysinit.target shutdown.target\nConflicts=shutdown.target\nConditionFirstBoot=true\n[Service]\nType=oneshot\nExecStart=/bin/bash -c 'active=\"$(cat /sys/devices/system/cpu/smt/active)\" && if [[ \"$active\" != 1 ]] && grep -q \"vendor_id.*GenuineIntel\" /proc/cpuinfo; then echo \"Enabling SMT.\" && echo on > /sys/devices/system/cpu/smt/control; fi'\n[Install]\nWantedBy=sysinit.target\n","enabled":true,"name":"enable-smt-firstboot.service"},{"contents":"[Unit]\nDescription=Enable SMT on subsequent boots by changing kernel arguments\nConditionFirstBoot=true\n[Service]\nType=oneshot\nExecStart=/usr/bin/rpm-ostree kargs --append \"mitigations=auto tsx=auto\"\n[Install]\nWantedBy=multi-user.target    \n","enabled":true,"name":"enable-smt-subsequentboot.service"}]}}

Additional information

While perfoming an upgrade from 31.20200310.3.0 to the latest FCOS 32 I tracked this issue back to 31.20200323.2.0. 31.20200310.3.0 (the next oldest AMI available) does not exhibit the issue.

I found this issue (https://github.com/amzn/amzn-drivers/issues/84) that suggested console logging could be to blame. We had selinux in permissive mode and it was spamming. I never observed the "too much work for irq..." error. I disabled selinux anyway to clean up dmesg and the problem persisted in a new cluster.

I found this issue (https://github.com/awslabs/amazon-eks-ami/issues/454) that suggests IOPS may be to blame. We have an NVME disk and had 3 GP2 volumes attached. None of the volumes had used their burst budget, but the root volume had come close. I upped all these GP2 volumes to io1 with 3000 IOPS. The problem still exists with these settings.

dmesg errors:

[  +0.004979] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 677.
[  +0.005453] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 678.
[  +0.004989] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 679.
[  +0.005288] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 680.
[  +0.004988] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 681.
[  +0.005344] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 682.
[  +0.005178] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 683.
[  +0.005375] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 684.
[  +0.005158] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 685.
[  +0.005211] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 686.
[  +0.004996] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 687.
[  +0.004923] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 688.
[  +0.004767] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 689.
[  +0.005216] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 690.
[  +0.004990] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 691.
[  +0.005130] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 692.
[  +0.005046] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 693.
[  +0.005106] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 694.
[  +0.004974] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 695.
[  +0.005466] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 696.
[  +0.004906] ena 0000:00:03.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 697.
[  +0.005115] ena 0000:00:03.0 eth0: The number of lost tx completions is above the threshold (156 > 128). Reset the device
[  +0.006585] ena 0000:00:03.0 eth0: Trigger reset is on
[  +0.003350] ena 0000:00:03.0 eth0: tx_timeout: 1
[  +0.002907] ena 0000:00:03.0 eth0: suspend: 0
[  +0.002859] ena 0000:00:03.0 eth0: resume: 0
[  +0.002708] ena 0000:00:03.0 eth0: wd_expired: 0
[  +0.003022] ena 0000:00:03.0 eth0: interface_up: 18
[  +0.002948] ena 0000:00:03.0 eth0: interface_down: 17
[  +0.002961] ena 0000:00:03.0 eth0: admin_q_pause: 0
[  +0.002797] ena 0000:00:03.0 eth0: queue_0_tx_cnt: 724709
[  +0.003149] ena 0000:00:03.0 eth0: queue_0_tx_bytes: 122518884
[  +0.003475] ena 0000:00:03.0 eth0: queue_0_tx_queue_stop: 0
[  +0.003512] ena 0000:00:03.0 eth0: queue_0_tx_queue_wakeup: 0
[  +0.003386] ena 0000:00:03.0 eth0: queue_0_tx_dma_mapping_err: 0
[  +0.003820] ena 0000:00:03.0 eth0: queue_0_tx_linearize: 0
[  +0.003509] ena 0000:00:03.0 eth0: queue_0_tx_linearize_failed: 0
[  +0.003886] ena 0000:00:03.0 eth0: queue_0_tx_napi_comp: 1283867
[  +0.003740] ena 0000:00:03.0 eth0: queue_0_tx_tx_poll: 1283869
[  +0.004275] ena 0000:00:03.0 eth0: queue_0_tx_doorbells: 705972
[  +0.004358] ena 0000:00:03.0 eth0: queue_0_tx_prepare_ctx_err: 0
[  +0.003993] ena 0000:00:03.0 eth0: queue_0_tx_bad_req_id: 0
[  +0.004087] ena 0000:00:03.0 eth0: queue_0_tx_llq_buffer_copy: 0
[  +0.004392] ena 0000:00:03.0 eth0: queue_0_tx_missed_tx: 156
[  +0.004111] ena 0000:00:03.0 eth0: queue_0_rx_cnt: 614448
[  +0.003638] ena 0000:00:03.0 eth0: queue_0_rx_bytes: 54930866
[  +0.003869] ena 0000:00:03.0 eth0: queue_0_rx_rx_copybreak_pkt: 605369
[  +0.004186] ena 0000:00:03.0 eth0: queue_0_rx_csum_good: 0
[  +0.003116] ena 0000:00:03.0 eth0: queue_0_rx_refil_partial: 0
[  +0.003511] ena 0000:00:03.0 eth0: queue_0_rx_bad_csum: 0
[  +0.003219] ena 0000:00:03.0 eth0: queue_0_rx_page_alloc_fail: 0
[  +0.003951] ena 0000:00:03.0 eth0: queue_0_rx_skb_alloc_fail: 0
[  +0.003690] ena 0000:00:03.0 eth0: queue_0_rx_dma_mapping_err: 0
[  +0.003899] ena 0000:00:03.0 eth0: queue_0_rx_bad_desc_num: 0
[  +0.003616] ena 0000:00:03.0 eth0: queue_0_rx_bad_req_id: 0
[  +0.003696] ena 0000:00:03.0 eth0: queue_0_rx_empty_rx_ring: 0
[  +0.003581] ena 0000:00:03.0 eth0: queue_0_rx_csum_unchecked: 0
[  +0.003806] ena 0000:00:03.0 eth0: queue_1_tx_cnt: 798224
[  +0.003319] ena 0000:00:03.0 eth0: queue_1_tx_bytes: 178783389
[  +0.003531] ena 0000:00:03.0 eth0: queue_1_tx_queue_stop: 0
[  +0.003206] ena 0000:00:03.0 eth0: queue_1_tx_queue_wakeup: 0
[  +0.003736] ena 0000:00:03.0 eth0: queue_1_tx_dma_mapping_err: 0
[  +0.003706] ena 0000:00:03.0 eth0: queue_1_tx_linearize: 0
[  +0.003585] ena 0000:00:03.0 eth0: queue_1_tx_linearize_failed: 0
[  +0.003793] ena 0000:00:03.0 eth0: queue_1_tx_napi_comp: 1084422
[  +0.003942] ena 0000:00:03.0 eth0: queue_1_tx_tx_poll: 1084426
[  +0.003634] ena 0000:00:03.0 eth0: queue_1_tx_doorbells: 754756
[  +0.003884] ena 0000:00:03.0 eth0: queue_1_tx_prepare_ctx_err: 0
[  +0.003721] ena 0000:00:03.0 eth0: queue_1_tx_bad_req_id: 0
[  +0.004029] ena 0000:00:03.0 eth0: queue_1_tx_llq_buffer_copy: 0
[  +0.003773] ena 0000:00:03.0 eth0: queue_1_tx_missed_tx: 0
[  +0.003678] ena 0000:00:03.0 eth0: queue_1_rx_cnt: 337173
[  +0.003360] ena 0000:00:03.0 eth0: queue_1_rx_bytes: 30450173
[  +0.003781] ena 0000:00:03.0 eth0: queue_1_rx_rx_copybreak_pkt: 328566
[  +0.004148] ena 0000:00:03.0 eth0: queue_1_rx_csum_good: 0
[  +0.003475] ena 0000:00:03.0 eth0: queue_1_rx_refil_partial: 0
[  +0.003790] ena 0000:00:03.0 eth0: queue_1_rx_bad_csum: 0
[  +0.003415] ena 0000:00:03.0 eth0: queue_1_rx_page_alloc_fail: 0
[  +0.003896] ena 0000:00:03.0 eth0: queue_1_rx_skb_alloc_fail: 0
[  +0.003722] ena 0000:00:03.0 eth0: queue_1_rx_dma_mapping_err: 0
[  +0.003865] ena 0000:00:03.0 eth0: queue_1_rx_bad_desc_num: 0
[  +0.003665] ena 0000:00:03.0 eth0: queue_1_rx_bad_req_id: 0
[  +0.003715] ena 0000:00:03.0 eth0: queue_1_rx_empty_rx_ring: 0
[  +0.003726] ena 0000:00:03.0 eth0: queue_1_rx_csum_unchecked: 0
[  +0.004074] ena 0000:00:03.0 eth0: ena_admin_q_aborted_cmd: 0
[  +0.003769] ena 0000:00:03.0 eth0: ena_admin_q_submitted_cmd: 311
[  +0.003899] ena 0000:00:03.0 eth0: ena_admin_q_completed_cmd: 311
[  +0.004053] ena 0000:00:03.0 eth0: ena_admin_q_out_of_space: 0
[  +0.003688] ena 0000:00:03.0 eth0: ena_admin_q_no_completion: 0
[ +20.848000] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [runc:611505]
[  +0.004967] Modules linked in: binfmt_misc nf_tables vxlan ip6_udp_tunnel udp_tunnel xt_statistic ipt_REJECT nf_reject_ipv4 veth iptable_mangle xt_comment xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc overlay iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat intel_rapl_msr intel_rapl_common sb_edac cirrus snd_pcsp drm_kms_helper snd_pcm drm snd_timer intel_rapl_perf snd i2c_piix4 soundcore ip_tables xfs rfkill xen_blkfront crct10dif_pclmul crc32_pclmul nvme ghash_clmulni_intel serio_raw nvme_core ena ata_generic pata_acpi btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c crc32c_intel zstd_decompress fuse
[  +0.043487] CPU: 2 PID: 611505 Comm: runc Tainted: G        W    L    5.5.10-200.fc31.x86_64 #1
[  +0.006328] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[  +0.004443] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0xc/0x12
[  +0.005270] Code: 56 57 41 50 41 51 41 52 41 53 e8 4f 05 00 00 41 5b 41 5a 41 59 41 58 5f 5e 5a 59 c3 66 90 52 b8 01 00 00 00 31 d2 f0 0f b0 17 <3c> 01 75 02 5a c3 56 0f b6 f0 e8 c5 ff ff ff 5e 5a c3 66 90 0f 1f
[  +0.011775] RSP: 0018:ffffbef181a77c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[  +0.004394] RAX: 0000000000000001 RBX: ffffa09ee03abfc0 RCX: ffffa09700000db8
[  +0.004210] RDX: 0000000000000000 RSI: ffffdf631f2d2800 RDI: ffffdf6318778928
[  +0.004339] RBP: 80000007cb4a0025 R08: 0000000000000000 R09: 0000000000820000
[  +0.004550] R10: 00000000005b86c7 R11: ffffffffa965bc48 R12: 0000000000000004
[  +0.004573] R13: 000055a99bfb7000 R14: ffffdf631f2d2800 R15: ffffbef181a77de0
[  +0.004505] FS:  0000000000000000(0000) GS:ffffa09f01280000(0000) knlGS:0000000000000000
[  +0.004951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.003672] CR2: 000000c000260000 CR3: 000000034f60a003 CR4: 00000000001606e0
[  +0.004454] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.004541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.004505] Call Trace:
[  +0.001874]  unmap_page_range+0x9af/0xed0
[  +0.002801]  unmap_vmas+0x7a/0xf0
[  +0.002384]  exit_mmap+0xad/0x1a0
[  +0.002416]  mmput+0x61/0x140
[  +0.002251]  do_exit+0x30d/0xb80
[  +0.002371]  do_group_exit+0x3a/0xa0
[  +0.002492]  __x64_sys_exit_group+0x14/0x20
[  +0.002863]  do_syscall_64+0x5b/0x1c0
[  +0.002564]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.003401] RIP: 0033:0x55a99bc78b2b
[  +0.002818] Code: Bad RIP value.
[  +0.002377] RSP: 002b:000000c000189f58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.004872] RAX: ffffffffffffffda RBX: 000000c000189968 RCX: 000055a99bc78b2b
[  +0.004573] RDX: 000000c000189f90 RSI: 0000000000000240 RDI: 0000000000000000
[  +0.004746] RBP: 000000c000189f58 R08: 0000000000000009 R09: ffffffffffffffff
[  +0.004380] R10: 00007f69462c2aa0 R11: 0000000000000246 R12: 00000000000000ff
[  +0.004311] R13: 00ffffffffffffff R14: 000055a99bfd2892 R15: 0000000000000038
[Nov 6 00:32] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [runc:611505]
[  +0.005295] Modules linked in: binfmt_misc nf_tables vxlan ip6_udp_tunnel udp_tunnel xt_statistic ipt_REJECT nf_reject_ipv4 veth iptable_mangle xt_comment xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc overlay iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat intel_rapl_msr intel_rapl_common sb_edac cirrus snd_pcsp drm_kms_helper snd_pcm drm snd_timer intel_rapl_perf snd i2c_piix4 soundcore ip_tables xfs rfkill xen_blkfront crct10dif_pclmul crc32_pclmul nvme ghash_clmulni_intel serio_raw nvme_core ena ata_generic pata_acpi btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c crc32c_intel zstd_decompress fuse
[  +0.043559] CPU: 2 PID: 611505 Comm: runc Tainted: G        W    L    5.5.10-200.fc31.x86_64 #1
[  +0.005694] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[  +0.004188] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0xc/0x12
[  +0.004384] Code: 56 57 41 50 41 51 41 52 41 53 e8 4f 05 00 00 41 5b 41 5a 41 59 41 58 5f 5e 5a 59 c3 66 90 52 b8 01 00 00 00 31 d2 f0 0f b0 17 <3c> 01 75 02 5a c3 56 0f b6 f0 e8 c5 ff ff ff 5e 5a c3 66 90 0f 1f
[  +0.011269] RSP: 0018:ffffbef181a77c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[  +0.004525] RAX: 0000000000000001 RBX: ffffa09ee03abfc0 RCX: ffffa09700000db8
[  +0.004807] RDX: 0000000000000000 RSI: ffffdf631f2d2800 RDI: ffffdf6318778928
[  +0.004828] RBP: 80000007cb4a0025 R08: 0000000000000000 R09: 0000000000820000
[  +0.004666] R10: 00000000005b86c7 R11: ffffffffa965bc48 R12: 0000000000000004
[  +0.004716] R13: 000055a99bfb7000 R14: ffffdf631f2d2800 R15: ffffbef181a77de0
[  +0.004648] FS:  0000000000000000(0000) GS:ffffa09f01280000(0000) knlGS:0000000000000000
[  +0.005314] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.003833] CR2: 000055a99bc78b01 CR3: 000000034f60a003 CR4: 00000000001606e0
[  +0.004494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.004468] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.004424] Call Trace:
[  +0.001879]  unmap_page_range+0x9af/0xed0
[  +0.002702]  unmap_vmas+0x7a/0xf0
[  +0.002288]  exit_mmap+0xad/0x1a0
[  +0.002330]  mmput+0x61/0x140
[  +0.002171]  do_exit+0x30d/0xb80
[  +0.002308]  do_group_exit+0x3a/0xa0
[  +0.002451]  __x64_sys_exit_group+0x14/0x20
[  +0.002804]  do_syscall_64+0x5b/0x1c0
[  +0.002538]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.003437] RIP: 0033:0x55a99bc78b2b
[  +0.002415] Code: Bad RIP value.
[  +0.002271] RSP: 002b:000000c000189f58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.004593] RAX: ffffffffffffffda RBX: 000000c000189968 RCX: 000055a99bc78b2b
[  +0.004409] RDX: 000000c000189f90 RSI: 0000000000000240 RDI: 0000000000000000
[  +0.004430] RBP: 000000c000189f58 R08: 0000000000000009 R09: ffffffffffffffff
[  +0.004373] R10: 00007f69462c2aa0 R11: 0000000000000246 R12: 00000000000000ff
[  +0.004448] R13: 00ffffffffffffff R14: 000055a99bfd2892 R15: 0000000000000038
[  +2.030502] ena 0000:00:03.0 eth0: free uncompleted tx skb qid 0 idx 0x21e
[  +0.005300] ena 0000:00:03.0 eth0: free uncompleted tx skb qid 1 idx 0x14f
[  +0.016827] ena: ena device version: 0.10
[  +0.003123] ena: ena controller version: 0.0.1 implementation version 1
[  +0.057904] ena 0000:00:03.0: Device reset completed successfully, Driver info: Elastic Network Adapter (ENA) v2.1.0K

etcd errors:

Nov 05 19:00:26 ip-172-27-189-80 podman[1774]: 2020-11-05 19:00:26.094701 E | rafthttp: failed to read 293a74e392af7f38 on stream MsgApp v2 (read tcp 172.27.189.80:53478->172.27.191.45:2380: i/o timeout)
Nov 05 19:01:32 ip-172-27-189-80 podman[1774]: 2020-11-05 19:01:32.098078 E | rafthttp: failed to read 2e26cab88dcc7664 on stream Message (read tcp 172.27.189.80:44522->172.27.186.181:2380: i/o timeout)
Nov 05 19:01:34 ip-172-27-189-80 podman[1774]: 2020-11-05 19:01:34.425868 E | rafthttp: failed to read 293a74e392af7f38 on stream Message (read tcp 172.27.189.80:56660->172.27.191.45:2380: i/o timeout)
Nov 05 19:27:31 ip-172-27-189-80 podman[1774]: 2020-11-05 19:27:31.917440 E | rafthttp: failed to read 293a74e392af7f38 on stream Message (read tcp 172.27.189.80:57326->172.27.191.45:2380: i/o timeout)
Nov 05 19:27:37 ip-172-27-189-80 podman[1774]: 2020-11-05 19:27:37.708048 E | rafthttp: failed to read 293a74e392af7f38 on stream MsgApp v2 (read tcp 172.27.189.80:57328->172.27.191.45:2380: i/o timeout)
Nov 05 19:51:27 ip-172-27-189-80 podman[1774]: 2020-11-05 19:51:27.450964 E | rafthttp: failed to read 2e26cab88dcc7664 on stream Message (read tcp 172.27.189.80:48362->172.27.186.181:2380: i/o timeout)
Nov 05 19:54:04 ip-172-27-189-80 podman[1774]: 2020-11-05 19:54:04.323375 E | rafthttp: failed to read 293a74e392af7f38 on stream MsgApp v2 (read tcp 172.27.189.80:38662->172.27.191.45:2380: i/o timeout)
Nov 05 19:54:04 ip-172-27-189-80 podman[1774]: 2020-11-05 19:54:04.466809 E | rafthttp: failed to read 2e26cab88dcc7664 on stream MsgApp v2 (read tcp 172.27.189.80:40816->172.27.186.181:2380: i/o timeout)
Nov 05 19:55:05 ip-172-27-189-80 podman[1774]: 2020-11-05 19:55:05.988079 E | rafthttp: failed to read 293a74e392af7f38 on stream MsgApp v2 (read tcp 172.27.189.80:51122->172.27.191.45:2380: i/o timeout)
Nov 05 19:59:49 ip-172-27-189-80 podman[1774]: 2020-11-05 19:59:49.474327 E | rafthttp: failed to read 2e26cab88dcc7664 on stream MsgApp v2 (read tcp 172.27.189.80:42168->172.27.186.181:2380: i/o timeout)
Nov 05 20:12:04 ip-172-27-189-80 podman[1774]: 2020-11-05 20:12:04.288924 E | rafthttp: failed to read 293a74e392af7f38 on stream MsgApp v2 (read tcp 172.27.189.80:51972->172.27.191.45:2380: i/o timeout)
Nov 05 20:12:04 ip-172-27-189-80 podman[1774]: 2020-11-05 20:12:04.490513 E | rafthttp: failed to read 2e26cab88dcc7664 on stream MsgApp v2 (read tcp 172.27.189.80:45022->172.27.186.181:2380: i/o timeout)
Nov 05 20:12:57 ip-172-27-189-80 podman[1774]: 2020-11-05 20:12:57.924410 E | rafthttp: failed to write 293a74e392af7f38 on stream Message (write tcp 172.27.189.80:2380->172.27.191.45:38182: write: connection reset by peer)
Nov 05 20:13:51 ip-172-27-189-80 podman[1774]: 2020-11-05 20:13:51.139039 E | rafthttp: failed to read 293a74e392af7f38 on stream Message (read tcp 172.27.189.80:59796->172.27.191.45:2380: i/o timeout)
Nov 05 20:13:51 ip-172-27-189-80 podman[1774]: 2020-11-05 20:13:51.159536 E | rafthttp: failed to read 2e26cab88dcc7664 on stream MsgApp v2 (read tcp 172.27.189.80:50836->172.27.186.181:2380: i/o timeout)
Nov 05 20:16:24 ip-172-27-189-80 podman[1774]: 2020-11-05 20:16:24.496357 E | rafthttp: failed to read 2e26cab88dcc7664 on stream Message (read tcp 172.27.189.80:51608->172.27.186.181:2380: i/o timeout)
Nov 05 20:16:24 ip-172-27-189-80 podman[1774]: 2020-11-05 20:16:24.640868 E | rafthttp: failed to read 293a74e392af7f38 on stream MsgApp v2 (read tcp 172.27.189.80:60568->172.27.191.45:2380: i/o timeout)
Nov 05 20:17:20 ip-172-27-189-80 podman[1774]: 2020-11-05 20:17:20.670347 E | rafthttp: failed to dial 293a74e392af7f38 on stream Message (dial tcp: i/o timeout)
Nov 05 20:17:20 ip-172-27-189-80 podman[1774]: 2020-11-05 20:17:20.671373 E | rafthttp: failed to dial 2e26cab88dcc7664 on stream Message (dial tcp: i/o timeout)
Nov 05 20:17:20 ip-172-27-189-80 podman[1774]: 2020-11-05 20:17:20.752767 E | rafthttp: failed to dial 293a74e392af7f38 on stream MsgApp v2 (dial tcp: i/o timeout)
Nov 05 20:17:22 ip-172-27-189-80 podman[1774]: 2020-11-05 20:17:22.670739 E | rafthttp: failed to dial 2e26cab88dcc7664 on stream Message (dial tcp: lookup ip-172-27-186-181.us-west-2.compute.internal on 172.19.92.12:53: read udp 172.27.189.80:53385->172.19.92.12:53: i/o timeout)
Nov 05 20:17:27 ip-172-27-189-80 podman[1774]: 2020-11-05 20:17:27.805085 E | rafthttp: failed to read 293a74e392af7f38 on stream Message (read tcp 172.27.189.80:62102->172.27.191.45:2380: i/o timeout)
Nov 05 20:21:41 ip-172-27-189-80 podman[1774]: 2020-11-05 20:21:41.169964 E | rafthttp: failed to read 2e26cab88dcc7664 on stream MsgApp v2 (read tcp 172.27.189.80:53132->172.27.186.181:2380: i/o timeout)
Nov 05 20:22:33 ip-172-27-189-80 podman[1774]: 2020-11-05 20:22:33.180875 E | rafthttp: failed to dial 2e26cab88dcc7664 on stream MsgApp v2 (dial tcp 172.27.186.181:2380: i/o timeout)
Nov 05 20:51:47 ip-172-27-189-80 podman[1774]: 2020-11-05 20:51:47.876027 E | rafthttp: failed to read 2e26cab88dcc7664 on stream MsgApp v2 (read tcp 172.27.189.80:55990->172.27.186.181:2380: i/o timeout)
Nov 05 20:51:48 ip-172-27-189-80 podman[1774]: 2020-11-05 20:51:48.093153 E | rafthttp: failed to read 293a74e392af7f38 on stream Message (read tcp 172.27.189.80:62326->172.27.191.45:2380: i/o timeout)
Nov 05 20:52:37 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:37.948682 E | rafthttp: failed to write 293a74e392af7f38 on stream Message (write tcp 172.27.189.80:2380->172.27.191.45:49938: write: connection reset by peer)

etcd "timed out" warnings:


Nov 05 20:52:32 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:32.542741 W | etcdserver: read-only range request "key:\"/registry/traefik.containo.us/middlewares/\" range_end:\"/registry/traefik.containo.us/middlewares0\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.706909848s) to execute
Nov 05 20:52:32 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:32.542816 W | etcdserver: read-only range request "key:\"/registry/validatingwebhookconfigurations/\" range_end:\"/registry/validatingwebhookconfigurations0\" count_only:true " with result "error:etcdserver: request timed out" took too long (12.570813908s) to execute
Nov 05 20:52:32 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:32.542837 W | etcdserver: read-only range request "key:\"/registry/jobs/\" range_end:\"/registry/jobs0\" count_only:true " with result "error:etcdserver: request timed out" took too long (8.034930894s) to execute
Nov 05 20:52:32 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:32.542886 W | etcdserver: read-only range request "key:\"/registry/configmaps/\" range_end:\"/registry/configmaps0\" count_only:true " with result "error:etcdserver: request timed out" took too long (13.842682645s) to execute
Nov 05 20:52:32 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:32.544054 W | etcdserver: read-only range request "key:\"/registry/clusterroles/\" range_end:\"/registry/clusterroles0\" count_only:true " with result "error:etcdserver: request timed out" took too long (9.317314568s) to execute
Nov 05 20:52:39 ip-172-27-189-80 podman[1774]: 2020-11-05 20:52:39.542591 W | etcdserver: timed out waiting for read index response (local node might have slow network)```

lucab commented 3 years ago

@wcurry thanks for the very accurate report and for the bisection!

I'd agree the console logging seems unrelated, while IOPS-throttling seems more interesting (but possibly not exactly the same thing).

Here the most suspicious thing to me seems to be this:

[ +0.005115] ena 0000:00:03.0 eth0: The number of lost tx completions is above the threshold (156 > 128). Reset the device

Overall it looks like the kernel is having troubles keeping up with the load, and it seems to be something somehow specific to AWS or related to ENA. I don't think there is anything FCOS-specific at play. We unfortunately don't have the knowledge to triage and fix this here, so it would be better to bring it up to AWS kernel developers.

/cc @davdunc mmerkes

wcurry commented 3 years ago

I created this issue at amzn-drivers: https://github.com/amzn/amzn-drivers/issues/147

Of note, the ena version is the same between those two kernels. Here are the notes I provided in that issue:

Good

OS: Fedora Coreos 31.20200310.3.0 Kernel: 5.5.8-200.fc31.x86_64 ena version: 2.1.0K

Bad

OS: Fedora Coreos 31.20200323.2.0 Kernel: 5.5.10-200.fc31.x86_64 ena version: 2.1.0K

Good - FCOS 31.20200323.2.0 / 5.5.10-200.fc31.x86_64

$ ssh -i ~/.ssh/k8s-dev-us-west-2 core@172.27.187.214
Fedora CoreOS 31.20200310.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/

[core@ip-172-27-187-214 ~]$ uname -a
Linux ip-172-27-187-214 5.5.8-200.fc31.x86_64 #1 SMP Thu Mar 5 21:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[core@ip-172-27-187-214 ~]$ modinfo ena
filename:       /lib/modules/5.5.8-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
version:        2.1.0K
license:        GPL
description:    Elastic Network Adapter (ENA)
author:         Amazon.com, Inc. or its affiliates
srcversion:     DAAE6CFC0FC2113B5776480
alias:          pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           ena
vermagic:       5.5.8-200.fc31.x86_64 SMP mod_unload
sig_id:         PKCS#7
signer:         Fedora kernel signing key
sig_key:        0C:5D:ED:30:0B:3B:E3:23:0B:AD:A5:10:3E:7E:29:76:0E:6B:3A:1E
sig_hashalgo:   sha256
signature:      2A:A6:13:DB:14:78:41:12:F7:75:4D:6C:E5:B3:4E:45:6A:C0:3F:B9:
                6B:CA:73:16:A4:87:2B:42:67:D6:A5:4A:5D:1C:0F:0D:53:EF:C6:69:
                29:35:EB:AA:AC:C0:36:7F:DB:28:F7:25:1B:8E:31:A1:55:9D:78:EA:
                84:3A:61:9C:1C:58:74:AA:8B:BB:8B:AE:28:FC:9F:4D:68:CF:FA:CC:
                25:38:C6:15:F0:55:0E:7A:D1:31:CB:F9:73:C7:D4:32:C2:90:8A:10:
                31:43:BF:A1:08:12:C5:AA:96:8F:CE:F6:D0:9A:96:BA:60:18:A7:1F:
                10:1B:B2:BE:80:78:08:B0:07:14:99:E3:BD:6C:A7:D6:3E:57:45:BF:
                A4:48:6E:D4:9D:06:AE:51:C2:1C:3C:54:B0:36:8A:1D:2C:F6:0F:18:
                59:23:D9:BB:91:16:A7:EE:57:E9:7E:DB:22:0D:5D:62:25:E9:EF:97:
                F4:B7:86:DC:DE:B6:52:7C:AF:6A:CF:43:EA:A0:F7:70:D7:C5:97:8F:
                DC:7E:55:AB:F9:55:66:B8:9F:2D:C4:16:16:FE:F5:88:18:26:0E:A5:
                17:6D:64:CF:63:1A:B5:53:43:58:5D:11:19:76:4F:3D:B7:00:54:75:
                C5:45:7A:56:C7:AF:39:CF:E5:21:D0:43:58:53:20:58:09:0B:B9:AE:
                94:BE:90:51:37:DE:FF:24:74:CE:48:AB:3D:68:FB:BF:D6:5B:24:14:
                88:D9:DD:52:F3:3A:EE:6A:AA:21:77:76:C8:15:6C:50:BB:C5:21:E5:
                B5:41:C8:DA:61:61:0D:C2:48:5B:43:79:72:1D:29:94:CA:47:25:1B:
                59:AE:4D:5E:8D:5B:2C:FF:94:88:FC:34:6C:95:A8:53:8D:68:23:02:
                1D:04:A5:00:57:4F:BD:00:E4:6D:1E:1E:3A:2C:7F:43:A9:2A:3B:87:
                2A:D0:17:A8:67:74:13:A1:DA:E6:E6:8D:AA:A5:BB:4E:32:8A:67:35:
                BB:26:5C:39:9A:D9:F5:61:79:E7:E4:AE:4E:09:33:F3:F9:EE:8C:09:
                75:A6:74:1E:41:4E:82:98:A8:AA:04:99:AA:90:4D:DD:CD:CA:D8:95:
                67:2C:29:55:E8:C9:EA:23:A5:E4:EC:83:04:08:4A:CA:A0:84:1B:A1:
                4A:96:7B:3F:BD:36:2D:70:FB:A5:43:96:C3:24:69:41:A8:8E:FC:99:
                65:5F:7E:2B:A1:3D:D3:A0:77:86:F2:77:BC:69:F4:21:C7:3E:D7:89:
                C5:1A:7A:F0:D2:78:93:EE:BD:A5:F5:3F:3E:66:0F:EB:08:70:55:19:
                3D:1F:24:14:36:CD:51:E1:E5:FB:F4:22
parm:           debug:Debug level (0=none,...,16=all) (int)
[core@ip-172-27-187-214 ~]$ rpm -q --whatprovides /lib/modules/5.5.8-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
kernel-core-5.5.8-200.fc31.x86_64

Bad - FCOS 31.20200323.2.0 / 5.5.10-200.fc31.x86_64

$ ssh -i ~/.ssh/k8s-dev-us-west-2 core@172.27.187.162
Fedora CoreOS 31.20200323.2.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/

Last login: Mon Nov  9 20:09:18 2020 from 10.228.210.199
[core@ip-172-27-187-162 ~]$ uname -a
Linux ip-172-27-187-162 5.5.10-200.fc31.x86_64 #1 SMP Wed Mar 18 14:21:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[core@ip-172-27-187-162 ~]$ modinfo ena
filename:       /lib/modules/5.5.10-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
version:        2.1.0K
license:        GPL
description:    Elastic Network Adapter (ENA)
author:         Amazon.com, Inc. or its affiliates
srcversion:     DAAE6CFC0FC2113B5776480
alias:          pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           ena
vermagic:       5.5.10-200.fc31.x86_64 SMP mod_unload
sig_id:         PKCS#7
signer:         Fedora kernel signing key
sig_key:        67:90:9D:B2:92:99:F6:87:CC:07:EF:39:B6:7A:EC:9D:E7:E2:A2:60
sig_hashalgo:   sha256
signature:      7D:97:AB:FB:9C:FD:7B:70:E9:C9:3F:39:3B:9A:3A:B7:42:77:41:15:
                60:7B:1D:BD:B6:08:62:DA:64:B6:5E:F7:46:1A:2F:6D:8B:5E:80:2A:
                8F:88:5B:05:1F:AF:2C:B3:53:52:E0:8D:CB:BB:2C:D3:8E:E1:D1:DC:
                90:3C:27:CD:44:9E:7A:4B:14:1E:A9:D8:CA:72:7D:BB:F3:2B:59:85:
                B2:BB:48:83:75:45:24:28:B1:8F:EC:AA:79:E4:B9:CA:92:2F:09:4E:
                55:2D:28:11:EC:88:80:DC:D3:95:2E:BF:0F:67:59:76:5E:83:05:08:
                2E:CF:B2:FE:3E:C3:7A:3B:15:0F:67:73:14:C1:92:AF:4F:40:F1:51:
                2C:9D:D1:45:2E:F4:BC:59:50:51:B9:BC:AC:02:27:E6:2E:6F:E8:DB:
                48:EF:A8:AA:B8:28:8C:1D:B5:42:A0:73:4F:41:CC:1E:26:6F:21:93:
                50:2A:CF:B6:65:5F:35:29:3D:39:7B:6B:BC:62:0B:6D:2A:7E:7B:65:
                C4:E2:D4:CA:1D:6B:68:B7:B1:CE:94:08:60:37:D2:ED:0B:F2:FC:D1:
                BD:91:CA:30:67:39:1A:E0:64:97:BA:5A:FE:FE:4C:E3:8B:FD:56:52:
                DE:5D:A3:B8:A0:40:D7:46:07:70:4C:B7:8C:CD:CE:5C:F7:52:C2:5F:
                5F:AF:4E:FB:55:17:CF:89:C0:AA:49:38:A7:66:B2:53:74:96:7A:42:
                65:85:7F:18:95:B4:A1:87:31:88:30:57:4C:E8:C9:9D:55:12:87:07:
                35:72:BC:FD:85:C9:F4:85:B6:0A:96:F9:73:BA:F0:22:8A:EA:7B:CF:
                FB:92:B2:BA:82:98:F3:27:83:B3:D4:9F:D2:39:3C:37:90:99:A2:BD:
                43:41:A7:C7:03:76:86:EC:A6:8D:16:F9:25:14:E7:97:34:EC:E5:EE:
                00:E4:19:2A:B8:23:AD:7B:00:54:79:96:BC:00:F5:47:B2:7C:AC:CF:
                6D:26:64:FD:B3:01:15:98:DF:09:B4:F0:09:ED:87:FA:E1:90:0F:98:
                E5:F8:BE:EF:12:32:ED:AC:57:8C:CD:8F:AF:E7:AD:0A:3D:01:8F:EE:
                1D:4C:D1:62:38:59:F4:FF:B1:D3:B7:B7:1F:97:F3:A8:28:0C:A3:3B:
                CC:A5:E7:E6:FD:85:9F:7A:E5:0B:D0:E5:16:4B:D5:72:66:95:8F:7C:
                C1:B4:BA:A7:0C:01:25:39:03:B4:76:18:C6:0B:D1:B8:1B:F5:45:FA:
                5E:B9:78:3F:24:D5:BE:E6:91:59:87:FC:04:4C:3F:BB:57:A3:4B:4C:
                45:89:D2:A2:62:61:5D:A6:D2:95:DF:2A
parm:           debug:Debug level (0=none,...,16=all) (int)

[core@ip-172-27-187-162 ~]$ rpm -q --whatprovides /lib/modules/5.5.10-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
kernel-core-5.5.10-200.fc31.x86_64

jlebon commented 3 years ago

To confirm, do you also see this issue in f32 versions? (And might be worth checking f33 as well).

Edit: Ahh right I see you did find this because it was present in f32. Would you be able to test f33 as well? It's possible that it was fixed in the latest kernel there. If so, then it might be easier to just wait until f33 hits testing and stable.

wcurry commented 3 years ago

I happened to test 33.20201101.1.0 and saw the issue there.

jlebon commented 3 years ago

I scanned the logs for 5.5.9 and 5.5.10 and nothing obvious jumped out.

wcurry commented 3 years ago

I found the issue. We were enabling SMT on first-boot by running a service with the following command:

  - name: enable-smt-firstboot.service
    enabled: true
    contents: |
      [Unit]
      Description=Enable SMT on first boot on Intel CPUs to disable MDS mitigation
      DefaultDependencies=no
      Before=sysinit.target shutdown.target
      Conflicts=shutdown.target
      ConditionFirstBoot=true
      [Service]
      Type=oneshot
      ExecStart=/bin/bash -c 'active="$(cat /sys/devices/system/cpu/smt/active)" && if [[ "$active" != 1 ]] && grep -q "vendor_id.*GenuineIntel" /proc/cpuinfo; then echo "Enabling SMT." && echo on > /sys/devices/system/cpu/smt/control; fi'
      [Install]
      WantedBy=sysinit.target

Later reboots took advantage of another unit that appended "mitigations=auto" to kargs.

As of 31.20200323.2.0, this apparently stopped working.

When adding only "--reboot" to our kargs unit and removing the above /sys/devices... unit, our etcd cluster would not survive simultaneous immediate reboots.

I have added the following to each of the systemd service units (excluding the kargs unit) to delay their start until second-boot:

ConditionKernelCommandLine=!ignition.firstboot

wcurry commented 3 years ago

To clarify the last comment, our etcd/kube-system hosts didn't recover from a simultaneous reboot due to the use of bootkube and lack of pod-checkpointing.

I'm closing this issue as we've got it working.

dustymabe commented 3 years ago

@wcurry - I'm glad you were able to figure out how to get unblocked. Thanks for updating this issue.

coreos / fedora-coreos-tracker