coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Nodes get stuck while rebooting #2337

Open lucab opened 6 years ago

lucab commented 6 years ago

Reproduction Steps

This is behavior has been observed under CI workloads:

  1. A Kubernetes master node is spawned and provisioned via ignition
  2. Tectonic CI continuosly poll via SSH to gather node status
  3. Docker and kubelet are provisioned via tectonic-torcx
  4. A reboot (via systemd DBUS API) is triggered by tectonic-torcx to pick the appropriate docker runtime version
  5. Node never completes the reboot. A manual hard-reboot is required.

Environment

Stable (1576.5.0) on both AWS and Azure. Tectonic CI seemingly also triggered this on Packet, but it has not been logged so far.

Logs

Both Azure and AWS recorded similar logs, where most of the services are stopped and the last log entry is from a failed systemd user session. I'm not sure whethere the system deadlocks there, or that just happens to be the last line going through the console.

Excerpt:

[  OK  ] Stopped target Verify torcx succeeded.
[  OK  ] Closed OpenSSH Server Socket.
[  OK  ] Stopped Network Service.
[  OK  ] Closed Network Service Netlink Socket.
[  OK  ] Started Activation of LVM2 logical volumes.
         Starting Clean up broken links in /etc/ssl/certs...
[  OK  ] Started Clean up broken links in /etc/ssl/certs.
         Starting Update CA bundle at /etc/ssl/certs/ca-certificates.crt...
[  OK  ] Started Update CA bundle at /etc/ssl/certs/ca-certificates.crt.
[  320.669257] kauditd_printk_skb: 41 callbacks suppressed
[  320.669259] audit: type=1130 audit(1517243031.533:151): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=update-ca-certificates comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
         [  320.790646] audit: type=1131 audit(1517243031.533:152): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=update-ca-certificates comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Starting User Manager for UID 500...
[FAILED] Failed to start User Manager for UID 500.
See 'systemctl status user@500.service' for details.
[  320.943612] audit: type=1130 audit(1517243031.806:153): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=user@500 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed'
AWS reboot logs ``` This is ip-10-0-42-31 (Linux x86_64 4.14.11-coreos) 13:43:40 SSH host key: SHA256:Jv1qX3bBPmt4/ALZew/6UCXHhfoEcO27u5wOIgjibeg (ECDSA) SSH host key: SHA256:7sNcqdzlBN3OtXWTL+cbQyx/IbMP0djS7BBB9oOysS8 (RSA) SSH host key: SHA256:HIH0sbN9euy6EMYBPzLbGPhxBuu4FQh7sKcz2rY5kHc (ED25519) SSH host key: SHA256:jmE++Pa0p6fCCFNxSsFjCBscqA4FOuV9DF8BlRf1yss (DSA) eth0: 10.0.42.31 fe80::803:daff:fe5b:1328 ip-10-0-42-31 login: [ OK ] Listening on udev Control Socket. Starting Create missing system files... [ OK ] Reached target Final Step. Stopping D-Bus System Message Bus... Stopping Docker Application Container Engine... [ OK ] Removed slice system-systemd\x2dfsck.slice. [ OK ] Stopped target Login Prompts. [ OK ] Removed slice system-disk\x2duuid.slice. Stopping Login Service... Stopping NFS status monitor for NFSv2/3 locking.... Stopping Update Engine... Unmounting /run/docker/netns/8c8ceb75d4f6... [ OK ] Stopped target Load user-provided cloud configs. [ OK ] Stopped target Load system-provided cloud configs. [ OK ] Closed Process Core Dump Socket. Stopping Authorization Manager... Unmounting /var/lib/docker/plugins... Stopping Getty on tty1... [ OK ] Stopped target Timers. [ OK ] Stopped Daily Log Rotation. [ OK ] Stopped Periodic Garbage Collection for rkt. [ OK ] Stopped Daily Cleanup of Tempor[ 100.981256] kauditd_printk_skb: 28 callbacks suppressed [ 100.981258] audit: type=1305 audit(1516715082.794:103): auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 op=remove_rule key=(null) list=5 res=1 [ OK ] Stopped Getty on tty1. [ 101.005202] audit: type=1130 audit(1516715082.818:104): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=getty@tty1 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 101.016697] docker0: port 1(veth8eed415) entered disabled state [ 101.019562] audit: type=1131 audit(1516715082.829:105): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=getty@tty1 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Update Engine. [ 101.035191] audit: type=1130 audit(1516715082.847:106): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=update-engine comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 101.038332] veth9086aa2: renamed from eth0 [ OK ] Stopped D-Bus System Message Bus. [ 101.052305] audit: type=1130 audit(1516715082.864:107): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=dbus comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 101.065190] audit: type=1131 audit(1516715082.864:108): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=dbus comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Authorization Manager. [ 101.081468] audit: type=1130 audit(1516715082.894:109): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=polkit comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 101.091858] audit: type=1131 audit(1516715082.894:110): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=polkit comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Serial Getty on ttyS0. [ 101.108667] audit: type=1130 audit(1516715082.921:111): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=serial-getty@ttyS0 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 101.113872] docker0: port 1(veth8eed415) entered disabled state [ 101.123479] audit: type=1131 audit(1516715082.921:112): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=serial-getty@ttyS0 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [[ 101.135000] device veth8eed415 left promiscuous mode [ 101.138097] docker0: port 1(veth8eed415) entered disabled state  OK ] Stopped Login Service. [ OK ] Started Create missing system files. [ OK ] Stopped Load Security Auditing Rules. [FAILED] Failed unmounting /run/torcx/unpack. [FAILED] Failed unmounting /run/docker/netns/8c8ceb75d4f6. [FAILED] Failed unmounting /var/lib/docker/plugins. [FAILED] Failed unmounting /var/lib/docker/c…b732f96441f4d9d499ad586473f59/shm. [FAILED] Failed unmounting /var/lib/docker/o…b6806a1812776c86aa22ecce57/merged. [FAILED] Failed unmounting /run/user/500. [FAILED] Failed unmounting /run/docker/netns/default. Unmounting /var/lib/docker/overlay2... [ OK ] Removed slice system-serial\x2dgetty.slice. [ OK ] Removed slice system-getty.slice. Stopping Permit User Sessions... [ OK ] Unmounted /var/lib/docker/overlay2. [ OK ] Stopped Permit User Sessions. [ OK ] Stopped target Remote File Systems. [ OK ] Stopped Bootstrap Kubernetes Node Components. [ OK ] Stopped Download Tectonic Assets. [ OK ] Stopped Docker Application Container Engine. Starting Clean docker datadir for torcx changes... Stopping Containerd Container Daemon... [ OK ] Stopped target Network is Online. [ OK ] Stopped Wait for Network to be Configured. Stopping Network Name Resolution... [ OK ] Stopped Containerd Container Daemon. [ OK ] Stopped Network Name Resolution. [ OK ] Stopped target Network. [ OK ] Stopped target Basic System. [ OK ] Stopped target Paths. [ OK ] Stopped Trigger for rm-assets.service. [ OK ] Stopped Forward Password Requests to Wall Directory Watch. [ OK ] Stopped Trigger for tectonic.service. [ OK ] Stopped Watch for update engine configuration changes. [ OK ] Stopped Watch for a cloud-config at /var/lib/coreos-install/user_data. [ OK ] Stopped Trigger for bootkube.service. [ OK ] Stopped target Sockets. [ OK ] Closed rkt metadata service socket. [ OK ] Closed OpenSSH Server Socket. Starting Activation of LVM2 logical volumes... [ OK ] Closed D-Bus System Message Bus Socket. [ OK ] Closed /dev/initctl Compatibility Named Pipe. [ OK ] Closed Docker Socket for the API. [ OK ] Stopped target Verify torcx succeeded. [ OK ] Stopped target Slices. [ OK ] Created slice User Slice of core. Stopping Network Service... [ OK ] Stopped Network Service. [ OK ] Closed Network Service Netlink Socket. [ OK ] Started Activation of LVM2 logical volumes. Starting Activation of LVM2 logical volumes... [ OK ] Started Activation of LVM2 logical volumes. Starting Clean up broken links in /etc/ssl/certs... [ OK ] Started Clean up broken links in /etc/ssl/certs. Starting Update CA bundle at /etc/ssl/certs/ca-certificates.crt... [ OK ] Started Clean docker datadir for torcx changes. [ OK ] Started Update CA bundle at /etc/ssl/certs/ca-certificates.crt. Starting User Manager for UID 500... [FAILED] Failed to start User Manager for UID 500. See 'systemctl status user@500.service' for details. ```
Azure reboot logs ``` [ OK ] Listening on udev Control Socket. [ OK ] Listening on udev Kernel Socket. Starting Create missing system files... Starting Activation of LVM2 logical volumes... Unmounting /var/lib/docker/containe…e8677778a4f01e9203f10124518/shm... Stopping Load Security Auditing Rules... Unmounting /mnt/resource... [ OK ] Reached target Final Step. Unmounting /var/lib/docker/plugins... Stopping OpenSSH per-connection server daemon... Unmounting /run/torcx/unpack... [ OK ] Closed Process Core Dump Socket. [ OK ] Removed slice system-systemd\x2dfsck.slice. Unmounting /var/lib/docker/overlay2…233df85afa0c71938c20edd0/merged... [ OK ] Stopped target Timers. [ OK ] Stopped Daily Cleanup of Temporary Directories. [ OK [ 314.912411] kauditd_printk_skb: 3 callbacks suppressed [ 314.912412] audit: type=1305 audit(1517243025.609:102): auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 op=remove_rule key=(null) list=5 res=1 Stopping Serial Getty on ttyS0... Stopping Getty on tty1... 2018/01/29 16:23:45.878801 INFO Agent WALinuxAgent-2.2.4 forwarding signal 15 to WALinuxAgent-2.2.20 Stopping Microsoft Azure Agent... Stopping D-Bus System Message Bus... [ OK ] Stopped Daily Log Rotation.[ 315.104038] docker0: port 1(veth0f19e96) entered disabled state [ OK ] [ 315.140982] veth29f1a92: renamed from eth0 Stopped target Load user-provided cloud configs. [ OK ] Stopped target Load system-provided cloud configs. [ OK ] Stopped target Containers. Stopping Docker Application Container Engine... [ OK ] Stopped Login Service. [ 315.272411] audit: type=1130 audit(1517243026.135:103): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=systemd-logind comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Update Engine. [ 315.382403] audit: type=1131 audit(1517243026.135:104): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=systemd-logind comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 315.474514] audit: type=1130 audit(1517243026.244:105): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=update-engine comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 315.499311] docker0: port 1(veth0f19e96) entered disabled state [ 315.562977] device veth0f19e96 left promiscuous mode [ 315.562984] docker0: port 1(veth0f19e96) entered disabled state [ 315.563012] audit: type=1700 audit(1517243026.362:106): dev=veth0f19e96 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295 [ OK ] Stopped D-Bus System Message Bus. [ 315.705128] audit: type=1300 audit(1517243026.362:106): arch=c000003e syscall=44 success=yes exit=32 a0=b a1=c4208098a0 a2=20 a3=0 items=0 ppid=1 pid=3322 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dockerd" exe="/run/torcx/unpack/docker/bin/dockerd" subj=system_u:system_r:kernel_t:s0 key=(null) [ 315.845266] audit: type=1327 audit(1517243026.362:106): proctitle=2F72756E2F746F7263782F62696E2F646F636B657264002D2D686F73743D66643A2F2F002D2D636F6E7461696E6572643D2F7661722F72756E2F646F636B65722F6C6962636F6E7461696E6572642F646F636B65722D636F6E7461696E6572642E736F636B002D2D73656C696E75782D656E61626C65643D74727565002D2D6C [ 315.974437] audit: type=1130 audit(1517243026.568:107): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=dbus comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 316.065643] audit: type=1131 audit(1517243026.568:108): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=dbus comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Serial Getty on ttyS0. [ 316.200221] audit: type=1130 audit(1517243027.061:109): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=serial-getty@ttyS0 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Getty on tty1. [ OK ] Stopped NFS status monitor for NFSv2/3 locking.. [ OK ] Stopped Microsoft Azure Agent. [ OK ] Stopped OpenSSH per-connection server daemon. [ OK ] Stopped OpenSSH per-connection server daemon. [ OK ] Started Create missing system files. [ OK ] Started Activation of LVM2 logical volumes. [ OK ] Unmounted /var/lib/docker/container…99e8677778a4f01e9203f10124518/shm. [ OK ] Stopped Load Security Auditing Rules. [ OK ] Unmounted /mnt/resource. [ OK ] Unmounted /var/lib/docker/plugins. [FAILED] Failed unmounting /run/torcx/unpack. [ OK ] Unmounted /var/lib/docker/overlay2/…2b233df85afa0c71938c20edd0/merged. [ OK ] Unmounted /run/docker/netns/6d7ec029b4ab. [ OK ] Unmounted /run/user/500. [ OK ] Unmounted /var/lib/docker/overlay2. [ OK ] Stopped Docker Application Container Engine. [ OK ] Stopped Bootstrap Kubernetes Node Components. Stopping Containerd Container Daemon... Starting Clean docker datadir for torcx changes... [ OK ] Removed slice system-sshd.slice. [ OK ] Stopped target Network is Online. [ OK ] Stopped Wait for Network to be Configured. [ OK ] Stopped Generate sshd host keys. [ OK ] Stopped target RPC Port Mapper. Stopping RPC Bind... [ OK ] Stopped target Host and Network Name Lookups. Stopping Network Name Resolution... [ OK ] Removed slice system-getty.slice. Stopping Permit User Sessions... [ OK ] Removed slice system-serial\x2dgetty.slice. [ OK ] Stopped Containerd Container Daemon. [ OK ] Stopped Network Name Resolution. [ OK ] Stopped RPC Bind. [ OK ] Started Clean docker datadir for torcx changes. [ OK ] Stopped Permit User Sessions. [ OK ] Stopped target Remote File Systems. [ OK ] Stopped target Network. [ OK ] Stopped Turn off TX checksum offloading. Stopping Network Service... [ OK ] Stopped target Basic System. [ OK ] Stopped target Paths. [ OK ] Stopped Forward Password Requests to Wall Directory Watch. [ OK ] Stopped Watch for a cloud-config at /var/lib/coreos-install/user_data. [ OK ] Stopped Watch for update engine configuration changes. Starting Activation of LVM2 logical volumes... [ OK ] Stopped Trigger for tectonic.service. [ OK ] Stopped Trigger for bootkube.service. [ OK ] Stopped target Slices. [ OK ] Created slice User Slice of core. [ OK ] Started Session 17 of user core. [ OK ] Stopped target Sockets. [ OK ] Closed rkt metadata service socket. [ OK ] Closed D-Bus System Message Bus Socket. [ OK ] Closed /dev/initctl Compatibility Named Pipe. [ OK ] Closed Docker Socket for the API. [ OK ] Stopped target Verify torcx succeeded. [ OK ] Closed OpenSSH Server Socket. [ OK ] Stopped Network Service. [ OK ] Closed Network Service Netlink Socket. [ OK ] Started Activation of LVM2 logical volumes. Starting Clean up broken links in /etc/ssl/certs... [ OK ] Started Clean up broken links in /etc/ssl/certs. Starting Update CA bundle at /etc/ssl/certs/ca-certificates.crt... [ OK ] Started Update CA bundle at /etc/ssl/certs/ca-certificates.crt. [ 320.669257] kauditd_printk_skb: 41 callbacks suppressed [ 320.669259] audit: type=1130 audit(1517243031.533:151): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=update-ca-certificates comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 320.790646] audit: type=1131 audit(1517243031.533:152): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=update-ca-certificates comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success' Starting User Manager for UID 500... [FAILED] Failed to start User Manager for UID 500. See 'systemctl status user@500.service' for details. [ 320.943612] audit: type=1130 audit(1517243031.806:153): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=user@500 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed' ```
yvsssantosh commented 6 years ago

Hi, I'm working on bare-metal with virtual VM's and even I have the same problem

screen shot 2018-03-27 at 2 56 20 am