Open lseelenbinder opened 6 years ago
Thanks for the report! A couple of initial questions to get out of the way - What versions of kubernetes, Container Linux, and CLUO are you running when this issue happens? Is the node updating when the reboot was triggered, or was it somehow manually triggered? Is it vanilla kubernetes, or something like Tectonic? Also, can you post the daemonset definition for the agent you are using on this cluster?
It looks like agent got to the point of sending a reboot request to systemd over dbus. That dbus method call is non-blocking, so it just goes to sleep for 7 days waiting for the reboot to occur. It seems like the dbus call is somehow failing or hanging (the response from the call is not checked).
Since it is happening on all the worker nodes, it might be something with the worker configuration. Maybe you can also post the Container Linux Config you used to provision the node, with the version of ct you used (just the raw ignition file would be fine too).
Kubernetes: 1.9.3, vanilla setup using Typhoon Container Linux: upgrading from 1632.3.0 -> 1688.5.3. CLUO: v0.6.0 DaemonSet:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"container-linux-update-agent","namespace":"reboot-coordinator"},"spec":{"selector":{"matchLabels":{"app":"container-linux-update-agent"}},"template":{"metadata":{"labels":{"app":"container-linux-update-agent"}},"spec":{"containers":[{"command":["/bin/update-agent"],"env":[{"name":"UPDATE_AGENT_NODE","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"quay.io/coreos/container-linux-update-operator:v0.6.0","name":"update-agent","volumeMounts":[{"mountPath":"/var/run/dbus","name":"var-run-dbus"},{"mountPath":"/etc/coreos","name":"etc-coreos"},{"mountPath":"/usr/share/coreos","name":"usr-share-coreos"},{"mountPath":"/etc/os-release","name":"etc-os-release"}]}],"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/var/run/dbus"},"name":"var-run-dbus"},{"hostPath":{"path":"/etc/coreos"},"name":"etc-coreos"},{"hostPath":{"path":"/usr/share/coreos"},"name":"usr-share-coreos"},{"hostPath":{"path":"/etc/os-release"},"name":"etc-os-release"}]}},"updateStrategy":{"rollingUpdate":{"maxUnavailable":1},"type":"RollingUpdate"}}}
creationTimestamp: 2018-03-07T18:41:12Z
generation: 1
labels:
app: container-linux-update-agent
name: container-linux-update-agent
namespace: reboot-coordinator
resourceVersion: "11671422"
selfLink: /apis/extensions/v1beta1/namespaces/reboot-coordinator/daemonsets/container-linux-update-agent
uid: 1bef0523-2237-11e8-ad1d-8a6be7c01678
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: container-linux-update-agent
template:
metadata:
creationTimestamp: null
labels:
app: container-linux-update-agent
spec:
containers:
- command:
- /bin/update-agent
env:
- name: UPDATE_AGENT_NODE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: quay.io/coreos/container-linux-update-operator:v0.6.0
imagePullPolicy: IfNotPresent
name: update-agent
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/dbus
name: var-run-dbus
- mountPath: /etc/coreos
name: etc-coreos
- mountPath: /usr/share/coreos
name: usr-share-coreos
- mountPath: /etc/os-release
name: etc-os-release
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- hostPath:
path: /var/run/dbus
type: ""
name: var-run-dbus
- hostPath:
path: /etc/coreos
type: ""
name: etc-coreos
- hostPath:
path: /usr/share/coreos
type: ""
name: usr-share-coreos
- hostPath:
path: /etc/os-release
type: ""
name: etc-os-release
templateGeneration: 1
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
CT:
{'ignition': {'config': {}, 'timeouts': {}, 'version': '2.1.0'},
'networkd': {},
'passwd': {},
'storage': {'files': [{'contents': {'source': 'data:,KUBELET_IMAGE_URL%3Ddocker%3A%2F%2Fgcr.io%2Fgoogle_containers%2Fhyperkube%0AKUBELET_IMAGE_TAG%3Dv1.9.3%0A',
'verification': {}},
'filesystem': 'root',
'group': {},
'mode': 420,
'path': '/etc/kubernetes/kubelet.env',
'user': {}},
{'contents': {'source': 'data:,fs.inotify.max_user_watches%3D16184%0A',
'verification': {}},
'filesystem': 'root',
'group': {},
'mode': 420,
'path': '/etc/sysctl.d/max-user-watches.conf',
'user': {}},
{'contents': {'source': 'data:,%23!%2Fbin%2Fbash%0Aset%20-e%0Aexec%20%2Fusr%2Fbin%2Frkt%20run%20%5C%0A%20%20--trust-keys-from-https%20%5C%0A%20%20--volume%20config%2Ckind%3Dhost%2Csource%3D%2Fetc%2Fkubernetes%20%5C%0A%20%20--mount%20volume%3Dconfig%2Ctarget%3D%2Fetc%2Fkubernetes%20%5C%0A%20%20--insecure-options%3Dimage%20%5C%0A%20%20docker%3A%2F%2Fgcr.io%2Fgoogle_containers%2Fhyperkube%3Av1.9.3%20%5C%0A%20%20--net%3Dhost%20%5C%0A%20%20--dns%3Dhost%20%5C%0A%20%20--exec%3D%2Fkubectl%20--%20--kubeconfig%3D%2Fetc%2Fkubernetes%2Fkubeconfig%20delete%20node%20%24(hostname)%0A',
'verification': {}},
'filesystem': 'root',
'group': {},
'mode': 484,
'path': '/etc/kubernetes/delete-node',
'user': {}}]},
'systemd': {'units': [{'enable': True, 'name': 'docker.service'},
{'mask': True, 'name': 'locksmithd.service'},
{'contents': '[Unit]\n'
'Description=Watch for kubeconfig\n'
'[Path]\n'
'PathExists=/etc/kubernetes/kubeconfig\n'
'[Install]\n'
'WantedBy=multi-user.target\n',
'enable': True,
'name': 'kubelet.path'},
{'contents': '[Unit]\n'
'Description=Wait for DNS entries\n'
'Wants=systemd-resolved.service\n'
'Before=kubelet.service\n'
'[Service]\n'
'Type=oneshot\n'
'RemainAfterExit=true\n'
"ExecStart=/bin/sh -c 'while ! "
"/usr/bin/grep '^[^#[:space:]]' "
'/etc/resolv.conf > /dev/null; do sleep 1; '
"done'\n"
'[Install]\n'
'RequiredBy=kubelet.service\n',
'enable': True,
'name': 'wait-for-dns.service'},
{'contents': '[Unit]\n'
'Description=Kubelet via Hyperkube\n'
'Requires=coreos-metadata.service\n'
'After=coreos-metadata.service\n'
'Wants=rpc-statd.service\n'
'[Service]\n'
'EnvironmentFile=/etc/kubernetes/kubelet.env\n'
'EnvironmentFile=/run/metadata/coreos\n'
'Environment="RKT_RUN_ARGS=--uuid-file-save=/var/cache/kubelet-pod.uuid '
'\\\n'
' '
'--volume=resolv,kind=host,source=/etc/resolv.conf '
'\\\n'
' --mount '
'volume=resolv,target=/etc/resolv.conf \\\n'
' --volume '
'var-lib-cni,kind=host,source=/var/lib/cni '
'\\\n'
' --mount '
'volume=var-lib-cni,target=/var/lib/cni '
'\\\n'
' --volume '
'opt-cni-bin,kind=host,source=/opt/cni/bin '
'\\\n'
' --mount '
'volume=opt-cni-bin,target=/opt/cni/bin '
'\\\n'
' --volume '
'var-log,kind=host,source=/var/log \\\n'
' --mount volume=var-log,target=/var/log '
'\\\n'
' --insecure-options=image"\n'
'ExecStartPre=/bin/mkdir -p /opt/cni/bin\n'
'ExecStartPre=/bin/mkdir -p '
'/etc/kubernetes/manifests\n'
'ExecStartPre=/bin/mkdir -p '
'/etc/kubernetes/cni/net.d\n'
'ExecStartPre=/bin/mkdir -p '
'/etc/kubernetes/checkpoint-secrets\n'
'ExecStartPre=/bin/mkdir -p '
'/etc/kubernetes/inactive-manifests\n'
'ExecStartPre=/bin/mkdir -p /var/lib/cni\n'
'ExecStartPre=/bin/mkdir -p '
'/var/lib/kubelet/volumeplugins\n'
'ExecStartPre=/usr/bin/bash -c "grep '
"'certificate-authority-data' "
"/etc/kubernetes/kubeconfig | awk '{print "
"$2}' | base64 -d > "
'/etc/kubernetes/ca.crt"\n'
'ExecStartPre=-/usr/bin/rkt rm '
'--uuid-file=/var/cache/kubelet-pod.uuid\n'
'ExecStart=/usr/lib/coreos/kubelet-wrapper '
'\\\n'
' --allow-privileged \\\n'
' --anonymous-auth=false \\\n'
' --client-ca-file=/etc/kubernetes/ca.crt '
'\\\n'
' --cluster_dns=10.3.0.10 \\\n'
' --cluster_domain=cluster.local \\\n'
' '
'--cni-conf-dir=/etc/kubernetes/cni/net.d '
'\\\n'
' --exit-on-lock-contention \\\n'
' '
'--hostname-override=${COREOS_DIGITALOCEAN_HOSTNAME}'
'\\\n'
' --kubeconfig=/etc/kubernetes/kubeconfig '
'\\\n'
' --lock-file=/var/run/lock/kubelet.lock '
'\\\n'
' --network-plugin=cni \\\n'
' '
'--node-labels=node-role.kubernetes.io/node '
'\\\n'
' '
'--pod-manifest-path=/etc/kubernetes/manifests '
'\\\n'
' '
'--volume-plugin-dir=/var/lib/kubelet/volumeplugins\n'
'ExecStop=-/usr/bin/rkt stop '
'--uuid-file=/var/cache/kubelet-pod.uuid\n'
'Restart=always\n'
'RestartSec=5\n'
'[Install]\n'
'WantedBy=multi-user.target\n',
'name': 'kubelet.service'},
{'contents': '[Unit]\n'
'Description=Waiting to delete Kubernetes '
'node on shutdown\n'
'[Service]\n'
'Type=oneshot\n'
'RemainAfterExit=true\n'
'ExecStart=/bin/true\n'
'ExecStop=/etc/kubernetes/delete-node\n'
'[Install]\n'
'WantedBy=multi-user.target\n',
'enable': True,
'name': 'delete-node.service'}]}}
The reboot is triggered via the automated system, but never actually reboots. A forced, manual reboot allows the update to continue as expected.
The idea the dbus call is failing sounds logical to me.
I'm sorry it took me so long to respond, I dropped the ball on this one. What platform are you using? Is it a cloud provider or bare metal?
Looking at the ignition config you provided, it looks like typhoon is setting up a delete-node.service
that is supposed to delete the node on shutdown. Digging into the typhoon CLCs, it looks like it only creates that unit for worker nodes, which would explain why you are not seeing it on your controllers. I'm not sure exactly how it works, but if it does run on shutdown, it is possible that for some reason it's hanging and preventing the machine from shutting down. Do you still have any machines that are stuck in this state or did you manually reboot them all? It might be interesting to see if the unit is generating any logs or failing in some way.
It may be worth filing an issue on the typhoon repo about this. I don't know if the delete-node.service
unit is unique to typhoon or if it is a more widely used approach, but it feels like the most likely culprit at this point.
I have CLUO running on my K8s cluster with CoreOS on both controller and worker nodes.
The reboots triggered on the controllers successfully complete, but the reboots on worker nodes hang indefinitely. For example:
Once this completes, the node is cordoned and should reboot, but the reboot itself never occurs.
Where should I check first to help debug this?
Thanks!