Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

kubelet doesn't start after reboot #1092

Closed jackfrancis closed 5 years ago

jackfrancis commented 5 years ago

Observed in a cluster built from master:

azureuser@k8s-agentpool1-29898149-vmss000000:~$ ps auxfww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    16:23   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [kworker/0:0]
root         4  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kworker/0:0H]
root         5  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [kworker/u4:0]
root         6  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [mm_percpu_wq]
root         7  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [ksoftirqd/0]
root         8  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [rcu_sched]
root         9  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [rcu_bh]
root        10  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [migration/0]
root        11  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [watchdog/0]
root        12  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [cpuhp/0]
root        13  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [cpuhp/1]
root        14  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [watchdog/1]
root        15  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [migration/1]
root        16  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [ksoftirqd/1]
root        17  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [kworker/1:0]
root        18  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kworker/1:0H]
root        19  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [kdevtmpfs]
root        20  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [netns]
root        21  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [rcu_tasks_kthre]
root        22  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [kauditd]
root        25  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [khungtaskd]
root        26  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [oom_reaper]
root        27  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [writeback]
root        28  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [kcompactd0]
root        29  0.0  0.0      0     0 ?        SN   16:23   0:00  \_ [ksmd]
root        30  0.0  0.0      0     0 ?        SN   16:23   0:00  \_ [khugepaged]
root        31  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [crypto]
root        32  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kintegrityd]
root        33  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kblockd]
root        34  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [ata_sff]
root        35  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [md]
root        36  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [edac-poller]
root        37  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [ib-comp-wq]
root        38  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [ib_mcast]
root        39  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [ib_nl_sa_wq]
root        40  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [hv_vmbus_con]
root        41  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [hv_pri_chan]
root        42  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [hv_sub_chan]
root        43  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [devfreq_wq]
root        44  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [watchdogd]
root        45  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [kworker/u4:1]
root        47  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [kswapd0]
root        48  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kworker/u5:0]
root        49  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [ecryptfs-kthrea]
root        91  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kthrotld]
root        92  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [nfit]
root        93  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [scsi_eh_0]
root        94  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [scsi_tmf_0]
root        95  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [storvsc_error_w]
root        96  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [scsi_eh_1]
root        97  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [scsi_tmf_1]
root        98  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [storvsc_error_w]
root        99  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [scsi_eh_2]
root       100  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [scsi_tmf_2]
root       101  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [storvsc_error_w]
root       102  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [scsi_eh_3]
root       103  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [scsi_tmf_3]
root       104  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [storvsc_error_w]
root       106  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [kworker/u4:3]
root       108  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kworker/0:1H]
root       110  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kworker/1:1H]
root       111  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [nvme-wq]
root       112  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [scsi_eh_4]
root       113  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [scsi_tmf_4]
root       114  0.0  0.0      0     0 ?        S    16:23   0:00  \_ [scsi_eh_5]
root       115  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [scsi_tmf_5]
root       116  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [mlx4]
root       117  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [rdma_cm]
root       118  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [mlx4_ib]
root       119  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [mlx4_ib_mcg]
root       120  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [ipoib_flush]
root       124  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [ipv6_addrconf]
root       133  0.0  0.0      0     0 ?        I<   16:23   0:00  \_ [kstrp]
root       191  0.0  0.0      0     0 ?        I    16:23   0:00  \_ [kworker/1:2]
root       305  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [raid5wq]
root       358  0.0  0.0      0     0 ?        S    16:24   0:00  \_ [jbd2/sda1-8]
root       359  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [ext4-rsv-conver]
root       440  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [rpciod]
root       441  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [xprtiod]
root       445  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [iscsi_eh]
root       502  0.0  0.0      0     0 ?        S    16:24   0:00  \_ [hv_balloon]
root       558  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [mlx4_health]
root       565  0.0  0.0      0     0 ?        I    16:24   0:00  \_ [kworker/0:3]
root       592  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [mlx4_en]
root      1083  0.0  0.0      0     0 ?        S    16:24   0:00  \_ [jbd2/sdb1-8]
root      1084  0.0  0.0      0     0 ?        I<   16:24   0:00  \_ [ext4-rsv-conver]
root      1226  0.0  0.0      0     0 ?        I    16:24   0:00  \_ [kworker/1:4]
root      1351  0.0  0.0      0     0 ?        I    16:24   0:00  \_ [kworker/0:4]
root      2575  0.0  0.0      0     0 ?        I    16:30   0:00  \_ [kworker/u4:2]
root         1  0.8  0.0  38112  6064 ?        Ss   16:23   0:03 /sbin/init
root       444  0.1  0.2  52076 18084 ?        Ss   16:24   0:00 /lib/systemd/systemd-journald
root       452  0.0  0.0  94772  1692 ?        Ss   16:24   0:00 /sbin/lvmetad -f
root       465  0.0  0.0  11360  1872 ?        Ss   16:24   0:00 /usr/lib/linux-tools/4.15.0-1041-azure/hv_kvp_daemon -n
root       486  0.0  0.0  42440  3888 ?        Ss   16:24   0:00 /lib/systemd/systemd-udevd
systemd+   621  0.0  0.0 100324  2476 ?        Ssl  16:24   0:00 /lib/systemd/systemd-timesyncd
root       954  0.0  0.0  16124   860 ?        Ss   16:24   0:00 /sbin/dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
root      1090  0.0  0.0   5220   148 ?        Ss   16:24   0:00 /sbin/iscsid
root      1091  0.0  0.0   5720  3504 ?        S<Ls 16:24   0:00 /sbin/iscsid
root      1096  0.0  0.0   4356   636 ?        Ss   16:24   0:00 /usr/lib/linux-tools/4.15.0-1041-azure/hv_vss_daemon -n
root      1100  0.0  0.0  20096  1172 ?        Ss   16:24   0:00 /lib/systemd/systemd-logind
root      1106  0.0  0.0  27728  2920 ?        Ss   16:24   0:00 /usr/sbin/cron -f
root      1107  0.2  0.8 648844 61800 ?        Ssl  16:24   0:00 /usr/bin/dockerd -H fd:// --storage-driver=overlay2 --bip=172.17.0.1/16
root      1324  0.1  0.4 413748 29548 ?        Ssl  16:24   0:00  \_ containerd --config /var/run/docker/containerd/containerd.toml --log-level info
root      1108  0.0  0.0 274488  6300 ?        Ssl  16:24   0:00 /usr/lib/accountsservice/accounts-daemon
syslog    1109  0.0  0.0 260628  3212 ?        Ssl  16:24   0:00 /usr/sbin/rsyslogd -n
message+  1112  0.0  0.0  42896  3784 ?        Ss   16:24   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root      1128  0.0  0.2  70156 20468 ?        Ss   16:24   0:00 /usr/bin/python3 -u /usr/sbin/waagent -daemon
root      1354  0.8  0.3 223480 25496 ?        Sl   16:24   0:03  \_ python3 -u bin/WALinuxAgent-2.2.38-py2.7.egg -run-exthandlers
root      1129  0.0  0.0  95368  1384 ?        Ssl  16:24   0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root      1130  0.0  0.0   4396  1264 ?        Ss   16:24   0:00 /usr/sbin/acpid
unscd     1131  0.0  0.0  14964  1604 ?        Ss   16:24   0:00 /usr/sbin/nscd -d
daemon    1132  0.0  0.0  26044  2180 ?        Ss   16:24   0:00 /usr/sbin/atd -f
root      1178  0.0  0.0 268632  5712 ?        Ssl  16:24   0:00 /usr/lib/policykit-1/polkitd --no-debug
root      1179  0.0  0.0   4924   112 ?        Ss   16:24   0:00 /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid --daemonise --scan --syslog
root      1203  0.0  0.0  59212  5076 ?        Ss   16:24   0:00 /usr/sbin/sshd -D
root      2639  0.0  0.0  92796  6888 ?        Ss   16:31   0:00  \_ sshd: azureuser [priv]
azureus+  2677  0.0  0.0  92796  3456 ?        S    16:31   0:00      \_ sshd: azureuser@pts/0
azureus+  2678  1.0  0.0  13028  4832 pts/0    Ss   16:31   0:00          \_ -bash
azureus+  2698  0.0  0.0  27772  2992 pts/0    R+   16:31   0:00              \_ ps auxfww
root      1268  0.0  0.0  19472   228 ?        Ss   16:24   0:00 /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
root      1278  0.0  0.0   6208   680 tty1     Ss+  16:24   0:00 /sbin/agetty --noclear tty1 linux
root      1282  0.0  0.0   6024   816 ttyS0    Ss+  16:24   0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
azureus+  2641  0.0  0.0  36832  4420 ?        Ss   16:31   0:00 /lib/systemd/systemd --user
azureus+  2642  0.0  0.0  61564  2280 ?        S    16:31   0:00  \_ (sd-pam)
root@k8s-agentpool1-29898149-vmss000000:/home/azureuser# systemctl start kubelet
Warning: kubelet.service changed on disk. Run 'systemctl daemon-reload' to reload units.
root@k8s-agentpool1-29898149-vmss000000:/home/azureuser# systemctl daemon-reload
root@k8s-agentpool1-29898149-vmss000000:/home/azureuser# systemctl start kubelet
jackfrancis commented 5 years ago

kubelet is systemctl-enabled, as expected:

azureuser@k8s-agentpool1-29898149-vmss000000:~$ systemctl list-unit-files | grep kubelet
kubelet-monitor.service                    static  
kubelet.service                            enabled 
kubelet-monitor.timer                      disabled
jackfrancis commented 5 years ago

This may be a regression due to #1073:

systemctl status kubelet
● kubelet.service - Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: failed (Result: start-limit-hit) since Thu 2019-04-18 18:42:20 UTC; 14min ago
  Process: 1437 ExecStartPre=/sbin/sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1 (code=exited, status=255)
  Process: 1434 ExecStartPre=/bin/mount --make-shared /var/lib/kubelet (code=exited, status=0/SUCCESS)
  Process: 1427 ExecStartPre=/bin/bash -c if [ $(mount | grep "/var/lib/kubelet" | wc -l) -le 0 ] ; then /bin/mount --bind /var/lib/kubelet /var/lib/kubelet ; fi (code=exited, statu
  Process: 1425 ExecStartPre=/bin/mkdir -p /var/lib/cni (code=exited, status=0/SUCCESS)
  Process: 1422 ExecStartPre=/bin/mkdir -p /var/lib/kubelet (code=exited, status=0/SUCCESS)
  Process: 1418 ExecStartPre=/bin/bash /opt/azure/containers/kubelet.sh (code=exited, status=0/SUCCESS)

Apr 18 18:42:19 k8s-agentpool1-30831010-vmss000000 systemd[1]: kubelet.service: Unit entered failed state.
Apr 18 18:42:19 k8s-agentpool1-30831010-vmss000000 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 18 18:42:20 k8s-agentpool1-30831010-vmss000000 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 18 18:42:20 k8s-agentpool1-30831010-vmss000000 systemd[1]: Stopped Kubelet.
Apr 18 18:42:20 k8s-agentpool1-30831010-vmss000000 systemd[1]: kubelet.service: Start request repeated too quickly.
Apr 18 18:42:20 k8s-agentpool1-30831010-vmss000000 systemd[1]: Failed to start Kubelet.
Apr 18 18:42:20 k8s-agentpool1-30831010-vmss000000 systemd[1]: kubelet.service: Unit entered failed state.
Apr 18 18:42:20 k8s-agentpool1-30831010-vmss000000 systemd[1]: kubelet.service: Failed with result 'start-limit-hit'.
jackfrancis commented 5 years ago

Confirmed, will revert #1073.