coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

Build out torcx for general use in advanced deployment scenarios #2215

Open bgilbert opened 6 years ago

bgilbert commented 6 years ago

Problem

/etc/coreos/docker-1.12 supports exactly two versions of Docker: 1.12 and “current” (the most recent Docker CE Stable at the time a Container Linux version was branched for alpha). This is accomplished by shipping two Docker torcx images in Container Linux, increasing the (compressed) size of the OS and update payload. This mechanism does not support other versions of Docker (such as 17.03, which has been validated for use with Kubernetes 1.8), and does not support torcx images not shipped in the OS.

/etc/coreos/docker-1.12 is intended to be a temporary interface to the torcx infrastructure, and will be dropped from the Container Linux alpha channel on June 6, 2018. Meanwhile, we need to build out the long-term torcx UX.

Terminology

Context

Container Linux is a compact, unified OS focused on providing the core infrastructure needed to run containers. Most additional software needed on a Container Linux system can, and should, run in a container. torcx is not intended to change that model.

In some environments, software components must be added to a Container Linux system which cannot run in a container, or an alternate version of such a component must be used. An obvious example is the container runtime itself. torcx is intended to provide a compact mechanism for managing such addons. This is expected to be a relatively advanced feature of Container Linux, and most users should never need to interact with torcx.

This bug is a proposal for the interface that torcx will present to users. Nothing here is finalized, and discussion is encouraged. The proposal does not specify the addons or addon versions which will be provided by CoreOS, and we expect that CoreOS will supply only a very small number of torcx images in addition to those shipped with Container Linux itself.

Requirements

  1. torcx should support addons for software that cannot usefully run in a container, and for which version selection may be desirable (e.g. Docker), or the software shouldn’t be included in the OS by default (e.g. third-party kernel modules).
  2. For addons included in the OS, Container Linux will provide a vendor profile selecting a default version of each, so the default experience won’t change. These defaults will track the latest stable version of each addon, subject to the usual distro caveats about blocker bugs and timing of release cycles.
  3. Users may need to select an alternate version of one or more addons for compatibility with their environment (e.g. a particular version of Kubernetes) or to adhere to a support matrix. They may also need to install additional addons. Users should be able to specify these requirements without listing addons that should follow the vendor profile.
  4. Non-default torcx images will not be shipped in the OS, but will be automatically downloaded and installed during OS updates. Systems not connected to the public Internet (and receiving Container Linux updates from a private CoreUpdate instance) will need a way to obtain these images. tectonic-torcx already has an externally-hosted manifest and knows how to download torcx images on its own; we should generalize this infrastructure where possible. Image download should be integrated into the update process, via some combination of torcx, update_engine, coreos-postinst, or similar. Remote images must be verified via either a locally-configured hash or a signature.
  5. CoreOS will not ship a particular addon version indefinitely. If a user has selected Docker 1.12, and Container Linux no longer supports Docker 1.12, the OS should take a well-defined action. There are no good options: we could fail to update to newer OS versions, update but fail to boot, switch back to the default image, or boot successfully but with no image installed. In any event, CoreOS will need to provide advance notice of the change. (Users cannot, in general, continue using the previous torcx image, because a particular build of e.g. Docker 1.12.6 will not work on arbitrary Container Linux releases. Changes in the version of the Go toolchain, dependent libraries, etc., can require addons to be rebuilt. Notionally, a torcx image is targeted at exactly one release of Container Linux, though the image hosting mechanism may deduplicate them as an implementation detail.)
  6. Users must be able to supply their own images and signing keys.
  7. torcx should be easily managed by higher-level tooling. For example, Kubernetes 1.8 should be able to select Docker 17.03; users shouldn’t need to do it themselves.

Proposal

Remotes

Add a concept called a “remote” and a corresponding JSON schema. A remote is a network image repository, represented by a short name (e.g. coreos or com.coreos.cl) and the following attributes:

Remotes are defined via individual files in a search path over the usual set of directories (/usr, /etc, /run). There should be a mechanism for overriding individual attributes of a remote, perhaps via drop-ins, to allow offline systems to use their own mirror for the CoreOS-provided remote. Some users will also configure their own remotes.

Question: should local torcx stores also be treated as a remote? Perhaps we could drop the distinction between them.

Manifest

This is essentially the tectonic-torcx manifest. It is downloaded from the network and lists the images available from a remote. The manifest is signed with a detached GPG signature.

Relative to the tectonic-torcx manifest, sourcePackage and defaultVersion can be dropped, and relative URLs should be permitted if they aren’t already.

Question: do path declarations make sense here? Remote repositories shouldn’t be able to reference local paths.

Profile

Users (or higher-level tooling working on their behalf) can create a custom profile specifying the image references that should be available on the system. These references override the corresponding image references in the vendor profile.

Extend image objects in torcx profiles to optionally specify the name of a remote. This would be handled during profile merging in the same way as other image attributes: the remote, if any, is taken from the last declaration for a given image. If a remote is specified, the image is fetched during the fetch phase.

Fetching

Fetching a profile requires downloading the manifest and signature for any referenced remote, checking the signature, comparing image hashes to any images cached locally, and fetching any missing images. This can’t run as part of the torcx generator, since networking may not be up yet. Fetching should occur in the initramfs on first boot, after Ignition runs; and also from coreos-postinst after an update. That duplication is unfortunate, but it allows the system to defer rebooting after an update until all of the pieces are available.

Question: this design requires a manual fetch operation if the user changes the local profile. Should the initramfs attempt to detect this case and fetch automatically?

Deprecation

Once an image becomes unsupportable, remove it from the OS, or in the case of a CoreOS-maintained remote, stop adding new OS-version-specific images. A Container Linux system which uses that image will then fail future OS updates in the coreos-postinst phase. This approach seems to most directly conform to the user’s stated intent: their workload will continue running, even at the cost of future security updates. All other alternatives directly cause breakage one way or another. In conjunction, Container Linux should do a better job of reporting update failures so they will be less likely to go unnoticed.

Tooling

lucab commented 6 years ago

Related question by @roffe at https://github.com/coreos/torcx/issues/94: will the torcx userland binary be part of the OS? Right now we just provide it as a docker image, which may pose a chicken-egg problem.

bgilbert commented 6 years ago

I expect we'll ship the userspace binary in the OS, yes.

bgilbert commented 6 years ago

The manifest should be signed with an inline signature, not detached. This makes the parsing more complex, but fixes signature validation failures in the presence of mirrors/caches/CDNs which may not sync both files at the same time.

Iodun commented 6 years ago

Is this project still on track? :) According to the CoreOS Blog Torcx is planned to be fully implemented at May 23, 2018.

nstielau commented 6 years ago

@Iodun This is on track at this point.

lucab commented 6 years ago

A quick status update on this, as we have reached the dates that we were initially targeting. The original proposal here has been ironed out with the additional details, but the implementation is still in progress.

The current status can be tracked as

BugRoger commented 6 years ago

According to the CoreOS Blog the flip to Docker 18.x and removal of the Docker 1.12 workaround is planned for July 18th.

As far as I can tell Kubernetes v1.10 is still not validated on Docker 18.x. We'd prefer to run against a validated version.

The removal of the 1.12 workaround without another mitigation forces us to upgrade a large fleet of clusters to 1.8+. Or turn off OS upgrades in general. 🙁

Is that still going to happen?

lucab commented 6 years ago

@BugRoger it is going to happen, but the initial timeline has been skewed as we are still reviewing/merging some of the components involved. This has also been discussed on the coreos-user ML. We are keeping the references in the comment above updated as we go, and we'll publish a timeline update once the groundwork is merged.