coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 60 forks source link

OKD 4.10 stable-to-stable gets stuck on network issue #1136

Closed vrutkovs closed 2 years ago

vrutkovs commented 2 years ago

Describe the bug OKD is running on top of FCOS via custom manifest https://github.com/openshift/.

Reproduction steps Steps to reproduce the behavior:

  1. Upgrade from quay.io/openshift/okd:4.10.0-0.okd-2022-03-07-131213 to quay.io/openshift/okd:4.10.0-0.okd-2022-03-19-144548 Example jobs:

Expected behavior Nodes upgrade and boot as expected

Actual behavior Nodes get stuck on boot due to network problem

System details

Example node output log for GCP test output.log

vrutkovs commented 2 years ago

/cc @LorbusChris

fortinj66 commented 2 years ago

Looks like both aws and gcp had issues getting hostnames

dustymabe commented 2 years ago

There's not really a lot of detail in here about what the actual problem is. Does the problem present itself on initial install (no upgrade)? What messages in the logs are you focusing on? Can the problem be reproduced with vanilla FCOS?

vrutkovs commented 2 years ago

Does the problem present itself on initial install

Fresh install works fine.

What messages in the logs are you focusing on?

(1 of 2) A start job is running for…lhost hostname (1min 6s / 5min 2s)
...
(2 of 2) A start job is running for…rn (SSH Keys) (1min 6s / no limit)

which shows that its a network failure

Can the problem be reproduced with vanilla FCOS?

Most likely not, didn't try that

LorbusChris commented 2 years ago

Yes it looks like the node-valid-hostname.service fails first, and then subsequently kubelet.service fails as well (on both AWS and GCP). Could this be due to the recent changes in how NM/Afterburn sort out the setting of the hostname or did this only affect the GCP case? @cgwalters might know more about this.

LorbusChris commented 2 years ago

From the GCP output.log linked in the description:

g \u001b[0;1;39mCoreO[  240.887519] coreos-teardown-initramfs[890]: info: taking down network device: ens4
S Tear Down Init[  240.897382] NetworkManager[560]: <info>  [1647714345.6650] policy: set-hostname: current hostname was changed outside NetworkManager: 'localhost'
ramfs\u001b[0m...
[  240.911105] coreos-teardown-initramfs[902]: RTNETLINK answers: Operation not supported
[  240.920229] NetworkManager[560]: <info>  [1647714345.6651] policy: set-hostname: set hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal' (from DHCPv4)
[\u001b[0;32m  OK  \u001b[[  240.936744] coreos-teardown-initramfs[890]: info: flushing all routing
0m] Stopped \u001b[0;[  240.944890] coreos-teardown-initramfs[890]: info: no initramfs hostname information to propagate
1;39mIgnition (d[  240.955207] coreos-teardown-initramfs[890]: info: no networking config is defined in the real root
isks)\u001b[0m.
[  240.965777] NetworkManager[560]: <warn>  [1647714345.6651] policy: set-hostname: couldn't set the system hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal': (22) Invalid argument
[  240.988267] coreos-teardown-initramfs[890]: info: skipping propagation of default networking configs
[\u001b[0;32m  OK  \u001b[[  240.997891] NetworkManager[560]: <info>  [1647714345.6754] policy: set-hostname: current hostname was changed outside NetworkManager: 'localhost'
0m] Stopped \u001b[0;[  241.012549] NetworkManager[560]: <info>  [1647714345.6754] policy: set-hostname: set hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal' (from DHCPv4)
1;39mIgnition (k[  241.030181] NetworkManager[560]: <warn>  [1647714345.6754] policy: set-hostname: couldn't set the system hostname to 'ci-ln-d4jm7mb-72292-vd7b9-master-0.c.XXXXXXXXXXXXXXXXXXXXXX.internal': (22) Invalid argument
args)\u001b[0m.
dustymabe commented 2 years ago

Since (IIUC) we can't really associate this with FCOS version numbers does anybody want to provide a rpm-ostree db diff between the two affected versions?

Does this happen if you do a fresh install (not upgrade) to okd:4.10.0-0.okd-2022-03-19-144548? I'm trying to figure out what is special about the upgrade that's triggering this.

LorbusChris commented 2 years ago

Sounds like this only happens on upgrades from the previous 4.10 release.

There are a lot of AVC denials in the linked logs, too. So, possibly SELinux is involved?

[ 9695.310318] audit: type=1400 audit(1647724444.357:3183): avc:  denied  { ioctl } for  pid=410212 comm=\"iptables\" path=\"/sys/fs/cgroup\" dev=\"cgroup2\" ino=1 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:cgroup_t:s0 tclass=dir permissive=0\r\n
dustymabe commented 2 years ago

Can someone check the nodes before AND after upgrade and see if the SELinux policy has been modified by running the following command:

sudo ostree admin config-diff | grep selinux/targeted/policy
LorbusChris commented 2 years ago

I haven't been able to get into an upgraded node, it seems network never comes up there. Will investigate further tomorrow. In the meantime, some logs from OKD machine-os 4.10.0-0.okd-2022-03-07-131213 (before upgrade). afterburn-sshkeys.service has failed there as well. I wonder what it's supposed to do, isn't ignition in charge of writing the keys?

sh-5.1# sudo ostree admin config-diff | grep selinux/targeted/policy
sh-5.1# rpm-ostree db diff
ostree diff commit from: rollback deployment (1372d0e8162cac5257828ecbb6ac1136c4229405781a763bde03fb3f16dc17d9)
ostree diff commit to:   booted deployment (c50b405770c03fc9671df06998b12cd4b98f8a80db20e25b59aabad7a8745bb5)
Upgraded:
  afterburn 5.1.0-1.fc35 -> 5.2.0-4.fc35
  afterburn-dracut 5.1.0-1.fc35 -> 5.2.0-4.fc35
  audit-libs 3.0.6-1.fc35 -> 3.0.7-1.fc35
  btrfs-progs 5.15.1-1.fc35 -> 5.16.1-1.fc35
  chrony 4.1-3.fc35 -> 4.2-1.fc35
  conmon 2:2.0.30-2.fc35 -> 2:2.1.0-2.fc35
  container-selinux 2:2.170.0-2.fc35 -> 2:2.173.1-1.fc35
  containernetworking-plugins 1.0.1-1.fc35 -> 1.0.1-3.fc35
  crun 1.4-1.fc35 -> 1.4.2-1.fc35
  cups-libs 1:2.3.3op2-12.fc35 -> 1:2.3.3op2-13.fc35
  dnsmasq 2.86-3.fc35 -> 2.86-4.fc35
  ethtool 2:5.15-1.fc35 -> 2:5.16-1.fc35
  expat 2.4.1-2.fc35 -> 2.4.3-1.fc35
  flatpak-session-helper 1.12.2-1.fc35 -> 1.12.4-1.fc35
  glib2 2.70.2-1.fc35 -> 2.70.3-2.fc35
  glibc 2.34-11.fc35 -> 2.34-25.fc35
  glibc-common 2.34-11.fc35 -> 2.34-25.fc35
  glibc-minimal-langpack 2.34-11.fc35 -> 2.34-25.fc35
  ignition 2.13.0-1.fc35 -> 2.13.0-3.fc35
  kernel 5.15.10-200.fc35 -> 5.16.7-200.fc35
  kernel-core 5.15.10-200.fc35 -> 5.16.7-200.fc35
  kernel-modules 5.15.10-200.fc35 -> 5.16.7-200.fc35
  libblkid 2.37.2-1.fc35 -> 2.37.3-1.fc35
  libfdisk 2.37.2-1.fc35 -> 2.37.3-1.fc35
  libibverbs 38.0-1.fc35 -> 38.1-2.fc35
  libipa_hbac 2.6.1-1.fc35 -> 2.6.3-1.fc35
  libmount 2.37.2-1.fc35 -> 2.37.3-1.fc35
  libsmartcols 2.37.2-1.fc35 -> 2.37.3-1.fc35
  libsmbclient 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
  libsss_certmap 2.6.1-1.fc35 -> 2.6.3-1.fc35
  libsss_idmap 2.6.1-1.fc35 -> 2.6.3-1.fc35
  libsss_nss_idmap 2.6.1-1.fc35 -> 2.6.3-1.fc35
  libsss_sudo 2.6.1-1.fc35 -> 2.6.3-1.fc35
  libuuid 2.37.2-1.fc35 -> 2.37.3-1.fc35
  libwbclient 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
  libxcrypt 4.4.27-1.fc35 -> 4.4.28-1.fc35
  libzstd 1.5.1-6.fc35 -> 1.5.2-1.fc35
  lua-libs 5.4.3-2.fc35 -> 5.4.4-1.fc35
  polkit 0.120-1.fc35 -> 0.120-1.fc35.1
  polkit-libs 0.120-1.fc35 -> 0.120-1.fc35.1
  rpm 4.17.0-1.fc35 -> 4.17.0-4.fc35
  rpm-libs 4.17.0-1.fc35 -> 4.17.0-4.fc35
  rpm-ostree 2022.1-2.fc35 -> 2022.2-1.fc35
  rpm-ostree-libs 2022.1-2.fc35 -> 2022.2-1.fc35
  rpm-plugin-selinux 4.17.0-1.fc35 -> 4.17.0-4.fc35
  runc 2:1.0.2-2.fc35 -> 2:1.1.0-1.fc35
  samba-client-libs 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
  samba-common 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
  samba-common-libs 2:4.15.3-0.fc35 -> 2:4.15.5-0.fc35
  selinux-policy 35.8-1.fc35 -> 35.13-1.fc35
  selinux-policy-targeted 35.8-1.fc35 -> 35.13-1.fc35
  shadow-utils 2:4.9-8.fc35 -> 2:4.9-9.fc35
  shadow-utils-subid 2:4.9-8.fc35 -> 2:4.9-9.fc35
  sssd-ad 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-client 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-common 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-common-pac 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-ipa 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-krb5 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-krb5-common 2.6.1-1.fc35 -> 2.6.3-1.fc35
  sssd-ldap 2.6.1-1.fc35 -> 2.6.3-1.fc35
  systemd 249.7-2.fc35 -> 249.9-1.fc35
  systemd-container 249.7-2.fc35 -> 249.9-1.fc35
  systemd-libs 249.7-2.fc35 -> 249.9-1.fc35
  systemd-pam 249.7-2.fc35 -> 249.9-1.fc35
  systemd-resolved 249.7-2.fc35 -> 249.9-1.fc35
  systemd-udev 249.7-2.fc35 -> 249.9-1.fc35
  util-linux 2.37.2-1.fc35 -> 2.37.3-1.fc35
  util-linux-core 2.37.2-1.fc35 -> 2.37.3-1.fc35
  vim-data 2:8.2.4068-1.fc35 -> 2:8.2.4232-1.fc35
  vim-minimal 2:8.2.4068-1.fc35 -> 2:8.2.4232-1.fc35
Removed:
  containerd-1.5.8-1.fc35.x86_64
  moby-engine-20.10.11-1.fc35.x86_64
  pigz-2.5-2.fc35.x86_64
  zincati-0.0.24-1.fc35.x86_64
Added:
  NetworkManager-ovs-1:1.32.12-2.fc35.x86_64
  cri-o-1.23.0-2.1.fc35.x86_64
  cri-tools-1.21.0-.fc34.2.1.x86_64
  dpdk-2:20.11-2.fc35.x86_64
  glusterfs-9.5-1.fc35.x86_64
  glusterfs-client-xlators-9.5-1.fc35.x86_64
  glusterfs-fuse-9.5-1.fc35.x86_64
  hwdata-0.356-1.fc35.noarch
  libdrm-2.4.109-1.fc35.x86_64
  libgfrpc0-9.5-1.fc35.x86_64
  libgfxdr0-9.5-1.fc35.x86_64
  libglusterfs0-9.5-1.fc35.x86_64
  libmspack-0.10.1-0.6.alpha.fc35.x86_64
  libpciaccess-0.16-5.fc35.x86_64
  libtool-ltdl-2.4.6-42.fc35.x86_64
  liburing-2.0-2.fc35.x86_64
  libxslt-1.1.34-6.fc35.x86_64
  mpdecimal-2.5.1-2.fc35.x86_64
  open-vm-tools-11.3.0-2.fc35.x86_64
  openshift-clients-0.0.1-1.el8.x86_64
  openshift-hyperkube-1.23.3-1.0.759c22b.x86_64
  openvswitch-2.15.0-7.fc35.x86_64
  pciutils-3.7.0-4.fc35.x86_64
  pciutils-libs-3.7.0-4.fc35.x86_64
  python-pip-wheel-21.2.3-4.fc35.noarch
  python-setuptools-wheel-57.4.0-1.fc35.noarch
  python3-libs-3.10.2-1.fc35.x86_64
  qemu-guest-agent-2:6.1.0-10.fc35.x86_64
  sssd-nfs-idmap-2.6.3-1.fc35.x86_64
  unbound-libs-1.13.2-1.fc35.x86_64
  xmlsec1-1.2.29-4.fc35.x86_64
  xmlsec1-openssl-1.2.29-4.fc35.x86_64
sh-5.1# journalctl -xeu afterburn-sshkeys@core.service
░░ A start job for unit afterburn-sshkeys@core.service has begun execution.
░░ 
░░ The job identifier is 268.
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.738 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys: Attempt #1
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.744 INFO Failed to fetch: error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/attr>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.744 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.746 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.747 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/ssh-key>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/block-p>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetching http://169.254.169.254/computeMetadata/v1/project/attributes/sshKeys:>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.750 INFO Fetch successful
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit afterburn-sshkeys@core.service has begun execution.
░░ 
░░ The job identifier is 268.
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.738 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys: Attempt #1
Mar 23 18:15:23 localhost afterburn[834]: Mar 23 18:15:23.744 INFO Failed to fetch: error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/attr>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.744 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/sshKeys>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.746 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.747 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/ssh-key>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.748 INFO Fetching http://169.254.169.254/computeMetadata/v1/instance/attributes/block-p>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetch failed with 404: resource not found
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.749 INFO Fetching http://169.254.169.254/computeMetadata/v1/project/attributes/sshKeys:>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.750 INFO Fetch successful
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.750 INFO Fetching http://169.254.169.254/computeMetadata/v1/project/attributes/ssh-keys>
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Mar 23 18:15:24.751 INFO Fetch successful
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Error: failed to run
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]: Caused by:
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]:     0: writing ssh keys
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 afterburn[834]:     1: invalid base64: Invalid last symbol 86, offset 366.
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 systemd[1]: afterburn-sshkeys@core.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit afterburn-sshkeys@core.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 systemd[1]: afterburn-sshkeys@core.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit afterburn-sshkeys@core.service has entered the 'failed' state with result 'exit-code'.
Mar 23 18:15:24 ci-ln-2tv45wb-72292-n9j2j-master-1 systemd[1]: Failed to start Afterburn (SSH Keys).
░░ Subject: A start job for unit afterburn-sshkeys@core.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit afterburn-sshkeys@core.service has finished with a failure.
░░ 
░░ The job identifier is 268 and the job result is failed.
bgilbert commented 2 years ago

afterburn-sshkeys.service has failed there as well. I wonder what it's supposed to do, isn't ignition in charge of writing the keys?

Ignition writes keys from the Ignition config. afterburn-sshkeys@.service writes keys specified via the cloud's SSH key mechanism.

vrutkovs commented 2 years ago

Invalid permissions for /etc/openvswitch were breaking networking