Closed wcurry closed 3 years ago
@wcurry thanks for the very accurate report and for the bisection!
I'd agree the console logging seems unrelated, while IOPS-throttling seems more interesting (but possibly not exactly the same thing).
Here the most suspicious thing to me seems to be this:
[ +0.005115] ena 0000:00:03.0 eth0: The number of lost tx completions is above the threshold (156 > 128). Reset the device
Overall it looks like the kernel is having troubles keeping up with the load, and it seems to be something somehow specific to AWS or related to ENA. I don't think there is anything FCOS-specific at play. We unfortunately don't have the knowledge to triage and fix this here, so it would be better to bring it up to AWS kernel developers.
/cc @davdunc mmerkes
I created this issue at amzn-drivers: https://github.com/amzn/amzn-drivers/issues/147
Of note, the ena version is the same between those two kernels. Here are the notes I provided in that issue:
OS: Fedora Coreos 31.20200310.3.0 Kernel: 5.5.8-200.fc31.x86_64 ena version: 2.1.0K
OS: Fedora Coreos 31.20200323.2.0 Kernel: 5.5.10-200.fc31.x86_64 ena version: 2.1.0K
$ ssh -i ~/.ssh/k8s-dev-us-west-2 core@172.27.187.214
Fedora CoreOS 31.20200310.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/
[core@ip-172-27-187-214 ~]$ uname -a
Linux ip-172-27-187-214 5.5.8-200.fc31.x86_64 #1 SMP Thu Mar 5 21:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[core@ip-172-27-187-214 ~]$ modinfo ena
filename: /lib/modules/5.5.8-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
version: 2.1.0K
license: GPL
description: Elastic Network Adapter (ENA)
author: Amazon.com, Inc. or its affiliates
srcversion: DAAE6CFC0FC2113B5776480
alias: pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias: pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias: pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias: pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
depends:
retpoline: Y
intree: Y
name: ena
vermagic: 5.5.8-200.fc31.x86_64 SMP mod_unload
sig_id: PKCS#7
signer: Fedora kernel signing key
sig_key: 0C:5D:ED:30:0B:3B:E3:23:0B:AD:A5:10:3E:7E:29:76:0E:6B:3A:1E
sig_hashalgo: sha256
signature: 2A:A6:13:DB:14:78:41:12:F7:75:4D:6C:E5:B3:4E:45:6A:C0:3F:B9:
6B:CA:73:16:A4:87:2B:42:67:D6:A5:4A:5D:1C:0F:0D:53:EF:C6:69:
29:35:EB:AA:AC:C0:36:7F:DB:28:F7:25:1B:8E:31:A1:55:9D:78:EA:
84:3A:61:9C:1C:58:74:AA:8B:BB:8B:AE:28:FC:9F:4D:68:CF:FA:CC:
25:38:C6:15:F0:55:0E:7A:D1:31:CB:F9:73:C7:D4:32:C2:90:8A:10:
31:43:BF:A1:08:12:C5:AA:96:8F:CE:F6:D0:9A:96:BA:60:18:A7:1F:
10:1B:B2:BE:80:78:08:B0:07:14:99:E3:BD:6C:A7:D6:3E:57:45:BF:
A4:48:6E:D4:9D:06:AE:51:C2:1C:3C:54:B0:36:8A:1D:2C:F6:0F:18:
59:23:D9:BB:91:16:A7:EE:57:E9:7E:DB:22:0D:5D:62:25:E9:EF:97:
F4:B7:86:DC:DE:B6:52:7C:AF:6A:CF:43:EA:A0:F7:70:D7:C5:97:8F:
DC:7E:55:AB:F9:55:66:B8:9F:2D:C4:16:16:FE:F5:88:18:26:0E:A5:
17:6D:64:CF:63:1A:B5:53:43:58:5D:11:19:76:4F:3D:B7:00:54:75:
C5:45:7A:56:C7:AF:39:CF:E5:21:D0:43:58:53:20:58:09:0B:B9:AE:
94:BE:90:51:37:DE:FF:24:74:CE:48:AB:3D:68:FB:BF:D6:5B:24:14:
88:D9:DD:52:F3:3A:EE:6A:AA:21:77:76:C8:15:6C:50:BB:C5:21:E5:
B5:41:C8:DA:61:61:0D:C2:48:5B:43:79:72:1D:29:94:CA:47:25:1B:
59:AE:4D:5E:8D:5B:2C:FF:94:88:FC:34:6C:95:A8:53:8D:68:23:02:
1D:04:A5:00:57:4F:BD:00:E4:6D:1E:1E:3A:2C:7F:43:A9:2A:3B:87:
2A:D0:17:A8:67:74:13:A1:DA:E6:E6:8D:AA:A5:BB:4E:32:8A:67:35:
BB:26:5C:39:9A:D9:F5:61:79:E7:E4:AE:4E:09:33:F3:F9:EE:8C:09:
75:A6:74:1E:41:4E:82:98:A8:AA:04:99:AA:90:4D:DD:CD:CA:D8:95:
67:2C:29:55:E8:C9:EA:23:A5:E4:EC:83:04:08:4A:CA:A0:84:1B:A1:
4A:96:7B:3F:BD:36:2D:70:FB:A5:43:96:C3:24:69:41:A8:8E:FC:99:
65:5F:7E:2B:A1:3D:D3:A0:77:86:F2:77:BC:69:F4:21:C7:3E:D7:89:
C5:1A:7A:F0:D2:78:93:EE:BD:A5:F5:3F:3E:66:0F:EB:08:70:55:19:
3D:1F:24:14:36:CD:51:E1:E5:FB:F4:22
parm: debug:Debug level (0=none,...,16=all) (int)
[core@ip-172-27-187-214 ~]$ rpm -q --whatprovides /lib/modules/5.5.8-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
kernel-core-5.5.8-200.fc31.x86_64
$ ssh -i ~/.ssh/k8s-dev-us-west-2 core@172.27.187.162
Fedora CoreOS 31.20200323.2.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/
Last login: Mon Nov 9 20:09:18 2020 from 10.228.210.199
[core@ip-172-27-187-162 ~]$ uname -a
Linux ip-172-27-187-162 5.5.10-200.fc31.x86_64 #1 SMP Wed Mar 18 14:21:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[core@ip-172-27-187-162 ~]$ modinfo ena
filename: /lib/modules/5.5.10-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
version: 2.1.0K
license: GPL
description: Elastic Network Adapter (ENA)
author: Amazon.com, Inc. or its affiliates
srcversion: DAAE6CFC0FC2113B5776480
alias: pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias: pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias: pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias: pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
depends:
retpoline: Y
intree: Y
name: ena
vermagic: 5.5.10-200.fc31.x86_64 SMP mod_unload
sig_id: PKCS#7
signer: Fedora kernel signing key
sig_key: 67:90:9D:B2:92:99:F6:87:CC:07:EF:39:B6:7A:EC:9D:E7:E2:A2:60
sig_hashalgo: sha256
signature: 7D:97:AB:FB:9C:FD:7B:70:E9:C9:3F:39:3B:9A:3A:B7:42:77:41:15:
60:7B:1D:BD:B6:08:62:DA:64:B6:5E:F7:46:1A:2F:6D:8B:5E:80:2A:
8F:88:5B:05:1F:AF:2C:B3:53:52:E0:8D:CB:BB:2C:D3:8E:E1:D1:DC:
90:3C:27:CD:44:9E:7A:4B:14:1E:A9:D8:CA:72:7D:BB:F3:2B:59:85:
B2:BB:48:83:75:45:24:28:B1:8F:EC:AA:79:E4:B9:CA:92:2F:09:4E:
55:2D:28:11:EC:88:80:DC:D3:95:2E:BF:0F:67:59:76:5E:83:05:08:
2E:CF:B2:FE:3E:C3:7A:3B:15:0F:67:73:14:C1:92:AF:4F:40:F1:51:
2C:9D:D1:45:2E:F4:BC:59:50:51:B9:BC:AC:02:27:E6:2E:6F:E8:DB:
48:EF:A8:AA:B8:28:8C:1D:B5:42:A0:73:4F:41:CC:1E:26:6F:21:93:
50:2A:CF:B6:65:5F:35:29:3D:39:7B:6B:BC:62:0B:6D:2A:7E:7B:65:
C4:E2:D4:CA:1D:6B:68:B7:B1:CE:94:08:60:37:D2:ED:0B:F2:FC:D1:
BD:91:CA:30:67:39:1A:E0:64:97:BA:5A:FE:FE:4C:E3:8B:FD:56:52:
DE:5D:A3:B8:A0:40:D7:46:07:70:4C:B7:8C:CD:CE:5C:F7:52:C2:5F:
5F:AF:4E:FB:55:17:CF:89:C0:AA:49:38:A7:66:B2:53:74:96:7A:42:
65:85:7F:18:95:B4:A1:87:31:88:30:57:4C:E8:C9:9D:55:12:87:07:
35:72:BC:FD:85:C9:F4:85:B6:0A:96:F9:73:BA:F0:22:8A:EA:7B:CF:
FB:92:B2:BA:82:98:F3:27:83:B3:D4:9F:D2:39:3C:37:90:99:A2:BD:
43:41:A7:C7:03:76:86:EC:A6:8D:16:F9:25:14:E7:97:34:EC:E5:EE:
00:E4:19:2A:B8:23:AD:7B:00:54:79:96:BC:00:F5:47:B2:7C:AC:CF:
6D:26:64:FD:B3:01:15:98:DF:09:B4:F0:09:ED:87:FA:E1:90:0F:98:
E5:F8:BE:EF:12:32:ED:AC:57:8C:CD:8F:AF:E7:AD:0A:3D:01:8F:EE:
1D:4C:D1:62:38:59:F4:FF:B1:D3:B7:B7:1F:97:F3:A8:28:0C:A3:3B:
CC:A5:E7:E6:FD:85:9F:7A:E5:0B:D0:E5:16:4B:D5:72:66:95:8F:7C:
C1:B4:BA:A7:0C:01:25:39:03:B4:76:18:C6:0B:D1:B8:1B:F5:45:FA:
5E:B9:78:3F:24:D5:BE:E6:91:59:87:FC:04:4C:3F:BB:57:A3:4B:4C:
45:89:D2:A2:62:61:5D:A6:D2:95:DF:2A
parm: debug:Debug level (0=none,...,16=all) (int)
[core@ip-172-27-187-162 ~]$ rpm -q --whatprovides /lib/modules/5.5.10-200.fc31.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
kernel-core-5.5.10-200.fc31.x86_64
To confirm, do you also see this issue in f32 versions? (And might be worth checking f33 as well).
Edit: Ahh right I see you did find this because it was present in f32. Would you be able to test f33 as well? It's possible that it was fixed in the latest kernel there. If so, then it might be easier to just wait until f33 hits testing and stable.
I happened to test 33.20201101.1.0 and saw the issue there.
I scanned the logs for 5.5.9 and 5.5.10 and nothing obvious jumped out.
I found the issue. We were enabling SMT on first-boot by running a service with the following command:
- name: enable-smt-firstboot.service
enabled: true
contents: |
[Unit]
Description=Enable SMT on first boot on Intel CPUs to disable MDS mitigation
DefaultDependencies=no
Before=sysinit.target shutdown.target
Conflicts=shutdown.target
ConditionFirstBoot=true
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'active="$(cat /sys/devices/system/cpu/smt/active)" && if [[ "$active" != 1 ]] && grep -q "vendor_id.*GenuineIntel" /proc/cpuinfo; then echo "Enabling SMT." && echo on > /sys/devices/system/cpu/smt/control; fi'
[Install]
WantedBy=sysinit.target
Later reboots took advantage of another unit that appended "mitigations=auto" to kargs.
As of 31.20200323.2.0, this apparently stopped working.
When adding only "--reboot" to our kargs unit and removing the above /sys/devices... unit, our etcd cluster would not survive simultaneous immediate reboots.
I have added the following to each of the systemd service units (excluding the kargs unit) to delay their start until second-boot:
ConditionKernelCommandLine=!ignition.firstboot
To clarify the last comment, our etcd/kube-system hosts didn't recover from a simultaneous reboot due to the use of bootkube and lack of pod-checkpointing.
I'm closing this issue as we've got it working.
@wcurry - I'm glad you were able to figure out how to get unblocked. Thanks for updating this issue.
Describe the bug Network is flaky after upgrade from 31.20200310.3.0 to 31.20200323.2.0.
Reproduction steps Steps to reproduce the behavior:
Expected behavior
CPU should not lockup. Network should deliver all packets. Network interface should not reset.
Actual behavior
System details
Ignition config
Additional information
While perfoming an upgrade from 31.20200310.3.0 to the latest FCOS 32 I tracked this issue back to 31.20200323.2.0. 31.20200310.3.0 (the next oldest AMI available) does not exhibit the issue.
I found this issue (https://github.com/amzn/amzn-drivers/issues/84) that suggested console logging could be to blame. We had selinux in permissive mode and it was spamming. I never observed the "too much work for irq..." error. I disabled selinux anyway to clean up dmesg and the problem persisted in a new cluster.
I found this issue (https://github.com/awslabs/amazon-eks-ami/issues/454) that suggests IOPS may be to blame. We have an NVME disk and had 3 GP2 volumes attached. None of the volumes had used their burst budget, but the root volume had come close. I upped all these GP2 volumes to io1 with 3000 IOPS. The problem still exists with these settings.
dmesg errors:
etcd errors:
etcd "timed out" warnings: