coreos / tectonic-forum

Apache License 2.0
30 stars 9 forks source link

Container Linux Updates #150

Open bobhenkel opened 7 years ago

bobhenkel commented 7 years ago

Tectonic Version

1.6.4-tectonic.1

Environment

AWS

Expected Behavior

Container Linux should auto-update when subscribed to a channel when an update is added to the channel.

Also Tectonic Web console should also not tell me everything is up to date when there a handful of updates that have not been applied. My cluster was stood up on May 30th and there have been three updates on the stable channel yet all but 2 of my masters are still on 1353.8.0.

Actual Behavior

The Tectonic Console is claiming all my nodes are updated, yet most of the are still on 1353.8.0 and I'm subscribed to stable channel. 3 masters(1 on 1353.8.0 and 2 on 1409.2.0). 6 workers(6 on 1353.8.0)

Here's a screen for one of the workers. Looks like it's checking for updates yet coming back with nothing. screencapture at wed jun 21 01 37 09 cdt 2017

Other Information

I was able to ssh onto nodes and run update_engine_client -check_for_update and force them to check for updates and update themselves. However this is not the way one should update Container Linux as you well know.

When I ssh into these notes I see the message Update Strategy: No Reboots. Not sure if that means anything or not. The 2 masters that updated both have this too, so I guess I'm kind of confused. 1 of the masters also showed this Update Strategy: No Reboots Failed Units: 1 bootkube.service

Maybe that relates to this https://github.com/coreos/tectonic-installer/issues/797

Feature Request

I'd also like to see a web interface to run on update_engine_client -check_for_update on select nodes or on all nodes in addition to the auto update abilities that I was hoping for.

bobhenkel commented 7 years ago

Might be useful if others see this issue...

My teammate also has a cluster up and his auto upgraded all nodes, both master and workers to the latest. One difference which may have something to do with this(or not) is my cluster was upgraded from Tectonic 1.6.2-tectonic.1 to 1.6.4-tectonic.1, where his was stood up on 1.6.4-tectonic.1 from the start.

crawford commented 7 years ago

Since no one has responded, I'll provide an explanation of what you are seeing. The Container Linux team throttles OS updates so that in the event there is a bug, it doesn't take out everyone at once. Your machines are hitting the throttle (just like every other machine around the globe), but Tectonic isn't smart enough yet to explain what is going on. update_engine_client -update "fixes" the issue because we've configured manual updates to bypass the throttle. Additionally, the "No Reboots" message can be safely ignored. That mechanism is watching for Locksmith which doesn't run in Tectonic clusters (it's been replaced by the Container Linux update operator).

robszumski commented 7 years ago

There are a few different issues at play here, and I'll explain each one. You are correct that this experience is not ideal and we need to enhance it.

MOTD generation This is a straight up bug. Tracking that issue here: https://github.com/coreos/container-linux-update-operator/issues/88

Container Linux Update Operator architecture The CLUO orchestrates updates around the cluster via the Kubernetes API, more specifically annotations and labels on the nodes. The centralized CLUO makes decisions about which nodes should update from the pool of nodes that are ready to update. This info comes from each node itself and the update-agent on it, which communicates this up to the Kubernetes API.

Global Update Service Updates are rolled out with rate limits and other quality assurance measures to keep them safe and consistent. This means that when nodes ask for the latest version, a subset are told about an update and a subset aren't told until a later check.

Tectonic Console When the Console says a node is up to date, that info is sourced from the node's view of the world, since there are a variety of reasons it can not want to update locally. Local issues like network connectivity can impact communicating with the update server.

We are going to explore updating the Console to check both against the node's local view of "am I up to date?" and the global view of the latest releases.

bobhenkel commented 7 years ago

That helps explain a lot and makes sense. To avoid any throttling can folks run their own Container Linux update server? I'm guessing there may be security issues that once found orgs will want to have full control on how soon or late the upgrade is applied.

crawford commented 7 years ago

@bobhenkel I believe that is possible today (though I'm fuzzy on the details). The longer term plan is to actually run an instance of Core Update in the cluster itself, so there is no throttling. That will also fix the issue of knowing which machines are out of date.

Emmenemoi commented 7 years ago

Consequence of this: the upgrade operator then get stuck on update failed status. And no documentation can be found to clear the status and retry after manual node os update.