Closed vrutkovs closed 2 years ago
/cc @LorbusChris
Looks like both aws and gcp had issues getting hostnames
There's not really a lot of detail in here about what the actual problem is. Does the problem present itself on initial install (no upgrade)? What messages in the logs are you focusing on? Can the problem be reproduced with vanilla FCOS?
Does the problem present itself on initial install
Fresh install works fine.
What messages in the logs are you focusing on?
(1 of 2) A start job is running for…lhost hostname (1min 6s / 5min 2s)
...
(2 of 2) A start job is running for…rn (SSH Keys) (1min 6s / no limit)
which shows that its a network failure
Can the problem be reproduced with vanilla FCOS?
Most likely not, didn't try that
Yes it looks like the node-valid-hostname.service
fails first, and then subsequently kubelet.service
fails as well (on both AWS and GCP). Could this be due to the recent changes in how NM/Afterburn sort out the setting of the hostname or did this only affect the GCP case? @cgwalters might know more about this.
From the GCP output.log linked in the description:
g \u001b[0;1;39mCoreO[ 240.887519] coreos-teardown-initramfs[890]: info: taking down network device: ens4
S Tear Down Init[ 240.897382] NetworkManager[560]: <info> [1647714345.6650] policy: set-hostname: current hostname was changed outside NetworkManager: 'localhost'
ramfs\u001b[0m...
[ 240.911105] coreos-teardown-initramfs[902]: RTNETLINK answers: Operation not supported
[ 240.920229] NetworkManager[560]: <info> [1647714345.6651] policy: set-hostname: set hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal' (from DHCPv4)
[\u001b[0;32m OK \u001b[[ 240.936744] coreos-teardown-initramfs[890]: info: flushing all routing
0m] Stopped \u001b[0;[ 240.944890] coreos-teardown-initramfs[890]: info: no initramfs hostname information to propagate
1;39mIgnition (d[ 240.955207] coreos-teardown-initramfs[890]: info: no networking config is defined in the real root
isks)\u001b[0m.
[ 240.965777] NetworkManager[560]: <warn> [1647714345.6651] policy: set-hostname: couldn't set the system hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal': (22) Invalid argument
[ 240.988267] coreos-teardown-initramfs[890]: info: skipping propagation of default networking configs
[\u001b[0;32m OK \u001b[[ 240.997891] NetworkManager[560]: <info> [1647714345.6754] policy: set-hostname: current hostname was changed outside NetworkManager: 'localhost'
0m] Stopped \u001b[0;[ 241.012549] NetworkManager[560]: <info> [1647714345.6754] policy: set-hostname: set hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal' (from DHCPv4)
1;39mIgnition (k[ 241.030181] NetworkManager[560]: <warn> [1647714345.6754] policy: set-hostname: couldn't set the system hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal': (22) Invalid argument
args)\u001b[0m.
Since (IIUC) we can't really associate this with FCOS version numbers does anybody want to provide a rpm-ostree db diff
between the two affected versions?
Does this happen if you do a fresh install (not upgrade) to okd:4.10.0-0.okd-2022-03-19-144548
? I'm trying to figure out what is special about the upgrade that's triggering this.
Sounds like this only happens on upgrades from the previous 4.10 release.
There are a lot of AVC denials in the linked logs, too. So, possibly SELinux is involved?
[ 9695.310318] audit: type=1400 audit(1647724444.357:3183): avc: denied { ioctl } for pid=410212 comm=\"iptables\" path=\"/sys/fs/cgroup\" dev=\"cgroup2\" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0\r\n
Can someone check the nodes before AND after upgrade and see if the SELinux policy has been modified by running the following command:
sudo ostree admin config-diff | grep selinux/targeted/policy
I haven't been able to get into an upgraded node, it seems network never comes up there. Will investigate further tomorrow. In the meantime, some logs from OKD machine-os 4.10.0-0.okd-2022-03-07-131213 (before upgrade). afterburn-sshkeys.service has failed there as well. I wonder what it's supposed to do, isn't ignition in charge of writing the keys?
sh-5.1# sudo ostree admin config-diff | grep selinux/targeted/policy
sh-5.1# rpm-ostree db diff
ostree diff commit from: rollback deployment (1372d0e8162cac5257828ecbb6ac1136c4229405781a763bde03fb3f16dc17d9)
ostree diff commit to: booted deployment (c50b405770c03fc9671df06998b12cd4b98f8a80db20e25b59aabad7a8745bb5)
Upgraded:
afterburn 5.1.0-1.fc35 -> 5.2.0-4.fc35
afterburn-dracut 5.1.0-1.fc35 -> 5.2.0-4.fc35
audit-libs 3.0.6-1.fc35 -> 3.0.7-1.fc35
btrfs-progs 5.15.1-1.fc35 -> 5.16.1-1.fc35
chrony 4.1-3.fc35 -> 4.2-1.fc35
conmon 2:2.0.30-2.fc35 -> 2:2.1.0-2.fc35
container-selinux 2:2.170.0-2.fc35 -> 2:2.173.1-1.fc35
containernetworking-plugins 1.0.1-1.fc35 -> 1.0.1-3.fc35
crun 1.4-1.fc35 -> 1.4.2-1.fc35
cups-libs 1:2.3.3op2-12.fc35 -> 1:2.3.3op2-13.fc35
dnsmasq 2.86-3.fc35 -> 2.86-4.fc35
ethtool 2:5.15-1.fc35 -> 2:5.16-1.fc35
expat 2.4.1-2.fc35 -> 2.4.3-1.fc35
flatpak-session-helper 1.12.2-1.fc35 -> 1.12.4-1.fc35
glib2 2.70.2-1.fc35 -> 2.70.3-2.fc35
glibc 2.34-11.fc35 -> 2.34-25.fc35
glibc-common 2.34-11.fc35 -> 2.34-25.fc35
glibc-minimal-langpack 2.34-11.fc35 -> 2.34-25.fc35
ignition 2.13.0-1.fc35 -> 2.13.0-3.fc35
kernel 5.15.10-200.fc35 -> 5.16.7-200.fc35
kernel-core 5.15.10-200.fc35 -> 5.16.7-200.fc35
kernel-modules 5.15.10-200.fc35 -> 5.16.7-200.fc35
libblkid 2.37.2-1.fc35 -> 2.37.3-1.fc35
libfdisk 2.37.2-1.fc35 -> 2.37.3-1.fc35
libibverbs 38.0-1.fc35 -> 38.1-2.fc35
libipa_hbac 2.6.1-1.fc35 -> 2.6.3-1.fc35
libmount 2.37.2-1.fc35 -> 2.37.3-1.fc35
libsmartcols 2.37.2-1.fc35 -> 2.37.3-1.fc35
libsmbclient 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
libsss_certmap 2.6.1-1.fc35 -> 2.6.3-1.fc35
libsss_idmap 2.6.1-1.fc35 -> 2.6.3-1.fc35
libsss_nss_idmap 2.6.1-1.fc35 -> 2.6.3-1.fc35
libsss_sudo 2.6.1-1.fc35 -> 2.6.3-1.fc35
libuuid 2.37.2-1.fc35 -> 2.37.3-1.fc35
libwbclient 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
libxcrypt 4.4.27-1.fc35 -> 4.4.28-1.fc35
libzstd 1.5.1-6.fc35 -> 1.5.2-1.fc35
lua-libs 5.4.3-2.fc35 -> 5.4.4-1.fc35
polkit 0.120-1.fc35 -> 0.120-1.fc35.1
polkit-libs 0.120-1.fc35 -> 0.120-1.fc35.1
rpm 4.17.0-1.fc35 -> 4.17.0-4.fc35
rpm-libs 4.17.0-1.fc35 -> 4.17.0-4.fc35
rpm-ostree 2022.1-2.fc35 -> 2022.2-1.fc35
rpm-ostree-libs 2022.1-2.fc35 -> 2022.2-1.fc35
rpm-plugin-selinux 4.17.0-1.fc35 -> 4.17.0-4.fc35
runc 2:1.0.2-2.fc35 -> 2:1.1.0-1.fc35
samba-client-libs 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
samba-common 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
samba-common-libs 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
selinux-policy 35.8-1.fc35 -> 35.13-1.fc35
selinux-policy-targeted 35.8-1.fc35 -> 35.13-1.fc35
shadow-utils 2:4.9-8.fc35 -> 2:4.9-9.fc35
shadow-utils-subid 2:4.9-8.fc35 -> 2:4.9-9.fc35
sssd-ad 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-client 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-common 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-common-pac 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-ipa 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-krb5 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-krb5-common 2.6.1-1.fc35 -> 2.6.3-1.fc35
sssd-ldap 2.6.1-1.fc35 -> 2.6.3-1.fc35
systemd 249.7-2.fc35 -> 249.9-1.fc35
systemd-container 249.7-2.fc35 -> 249.9-1.fc35
systemd-libs 249.7-2.fc35 -> 249.9-1.fc35
systemd-pam 249.7-2.fc35 -> 249.9-1.fc35
systemd-resolved 249.7-2.fc35 -> 249.9-1.fc35
systemd-udev 249.7-2.fc35 -> 249.9-1.fc35
util-linux 2.37.2-1.fc35 -> 2.37.3-1.fc35
util-linux-core 2.37.2-1.fc35 -> 2.37.3-1.fc35
vim-data 2:8.2.4068-1.fc35 -> 2:8.2.4232-1.fc35
vim-minimal 2:8.2.4068-1.fc35 -> 2:8.2.4232-1.fc35
Removed:
containerd-1.5.8-1.fc35.x86_64
moby-engine-20.10.11-1.fc35.x86_64
pigz-2.5-2.fc35.x86_64
zincati-0.0.24-1.fc35.x86_64
Added:
NetworkManager-ovs-1:1.32.12-2.fc35.x86_64
cri-o-1.23.0-2.1.fc35.x86_64
cri-tools-1.21.0-.fc34.2.1.x86_64
dpdk-2:20.11-2.fc35.x86_64
glusterfs-9.5-1.fc35.x86_64
glusterfs-client-xlators-9.5-1.fc35.x86_64
glusterfs-fuse-9.5-1.fc35.x86_64
hwdata-0.356-1.fc35.noarch
libdrm-2.4.109-1.fc35.x86_64
libgfrpc0-9.5-1.fc35.x86_64
libgfxdr0-9.5-1.fc35.x86_64
libglusterfs0-9.5-1.fc35.x86_64
libmspack-0.10.1-0.6.alpha.fc35.x86_64
libpciaccess-0.16-5.fc35.x86_64
libtool-ltdl-2.4.6-42.fc35.x86_64
liburing-2.0-2.fc35.x86_64
libxslt-1.1.34-6.fc35.x86_64
mpdecimal-2.5.1-2.fc35.x86_64
open-vm-tools-11.3.0-2.fc35.x86_64
openshift-clients-0.0.1-1.el8.x86_64
openshift-hyperkube-1.23.3-1.0.759c22b.x86_64
openvswitch-2.15.0-7.fc35.x86_64
pciutils-3.7.0-4.fc35.x86_64
pciutils-libs-3.7.0-4.fc35.x86_64
python-pip-wheel-21.2.3-4.fc35.noarch
python-setuptools-wheel-57.4.0-1.fc35.noarch
python3-libs-3.10.2-1.fc35.x86_64
qemu-guest-agent-2:6.1.0-10.fc35.x86_64
sssd-nfs-idmap-2.6.3-1.fc35.x86_64
unbound-libs-1.13.2-1.fc35.x86_64
xmlsec1-1.2.29-4.fc35.x86_64
xmlsec1-openssl-1.2.29-4.fc35.x86_64
sh-5.1# journalctl -xeu afterburn-sshkeys@core.service
░░ A start job for unit afterburn-sshkeys@core.service has begun execution.
░░
░░ The job identifier is 268.
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.738 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys: Attempt #1
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.744 INFO Failed to fetch: error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/attr>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.744 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.746 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.747 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/ssh-key>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/block-p>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetching http://169.254.169.254/computeMetadata/v1/project/attributes/sshKeys:>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.750 INFO Fetch successful
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit afterburn-sshkeys@core.service has begun execution.
░░
░░ The job identifier is 268.
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.738 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys: Attempt #1
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.744 INFO Failed to fetch: error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/attr>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.744 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.746 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.747 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/ssh-key>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/block-p>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetching http://169.254.169.254/computeMetadata/v1/project/attributes/sshKeys:>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.750 INFO Fetch successful
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.750 INFO Fetching http://169.254.169.254/computeMetadata/v1/project/attributes/ssh-keys>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.751 INFO Fetch successful
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Error: failed to run
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Caused by:
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: 0: writing ssh keys
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: 1: invalid base64: Invalid last symbol 86, offset 366.
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 systemd[1]: afterburn-sshkeys@core.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit afterburn-sshkeys@core.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 systemd[1]: afterburn-sshkeys@core.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit afterburn-sshkeys@core.service has entered the 'failed' state with result 'exit-code'.
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 systemd[1]: Failed to start Afterburn (SSH Keys).
░░ Subject: A start job for unit afterburn-sshkeys@core.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit afterburn-sshkeys@core.service has finished with a failure.
░░
░░ The job identifier is 268 and the job result is failed.
afterburn-sshkeys.service has failed there as well. I wonder what it's supposed to do, isn't ignition in charge of writing the keys?
Ignition writes keys from the Ignition config. afterburn-sshkeys@.service
writes keys specified via the cloud's SSH key mechanism.
Invalid permissions for /etc/openvswitch
were breaking networking
Describe the bug OKD is running on top of FCOS via custom manifest https://github.com/openshift/.
Reproduction steps Steps to reproduce the behavior:
quay.io/openshift/okd:4.10.0-0.okd-2022-03-07-131213
toquay.io/openshift/okd:4.10.0-0.okd-2022-03-19-144548
Example jobs:Expected behavior Nodes upgrade and boot as expected
Actual behavior Nodes get stuck on boot due to network problem
System details
Example node output log for GCP test output.log