Closed pierrebeaucamp closed 6 years ago
I also confirm the issue. So far switching to beta channel helps.
This should have been resolved last Thursday when we rolled the Container Linux Stable channel back to 1632.3.0. What you were seeing was k8s-node-bootstrap.service
forcing an update to the latest Stable release (which explains the first reboot) into 1688.4.0. Unfortunately, this latest Stable release suffered from reboot loops. Frustratingly, a vanilla deployment of Container Linux wouldn't have hit this due to the fact that update engine (the client that actually updates the OS) doesn't force updates by default and would have held at 1632.3.0 until we resumed updates on the Stable channel.
Now that the Stable channel has been rolled back, you should be good to deploy again. Thanks for your patience.
Thanks for the insight. I opened this bug report solely for tracking purposes on your end, I could already resolve my problem by manually setting a container linux version. Feel free to close this issue if it's no longer relevant.
Okay. Thank you again for the detailed report.
@crawford and I discussed this more offline. Correcting https://github.com/coreos/tectonic-installer/issues/3143#issuecomment-378312705 for the record:
From the log of the first boot, it appears that it was also booting 1688.4.0, which means the problem with 1688.4.0 was at least a little intermittent. Deployments from http://stable.release.core-os.net/amd64-usr/current/ would indeed have deployed 1688.4.0 until Thursday evening.
The log of the second boot doesn't demonstrate the boot loop. This message:
error: no such device: 00000000-0000-0000-0000-000000000001
is routine and occurs on every boot other than the first. 4.14.19-coreos is indeed the 1632.3.0 kernel, so this log just shows the machine rebooting into 1632.3.0. Automatic updates from 1632.3.0 to 1688.4.0 were only enabled for about half an hour on Tuesday, so it's unlikely that you were seeing a problem after updating from 1632.3.0; it's probable that your machine wasn't updating at all.
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
Tectonic version (release or commit hash):
Terraform version (
terraform version
):Platform (aws|azure|openstack|metal|vmware):
What happened?
The worker and master instances get stuck in a boot loop. The etcd instances work fine.
Below is a system log obtained through the AWS console:
What you expected to happen?
The cluster should just boot normally
How to reproduce it (as minimally and precisely as possible)?
As of the time of writing, I can reproduce this issue by either setting
tectonic_container_linux_version
tolatest
or to1688.4.0.
If I manually set the version to1632.3.0
, I'm able to create the cluster just fine.Anything else we need to know?
I'm launching the cluster using Terraform and I'm using m5.large instances. I could reproduce the issue on m4.large instances as well.
The EBS volumes are of type
gp2
.From looking through the syslog, it seems that the machines are able to come up after being created. However something on the worker and master machines triggers a reboot and the instances fail to come up afterwards.
Edit:
As a follow up, the 1632.3.0 are also not able to update themselves to 1688.4.0 either. They fail with the same error and then fall back to the previous release:
I release that this is probably more of a Container Linux issue / bug, but I decided to file the bug here because as of right now, this tool doesn't work for launching a new cluster with the default settings. Any running cluster will just gracefully fail the update and continue working