containerd / containerd

An open and reliable container runtime
https://containerd.io
Apache License 2.0
17.46k stars 3.45k forks source link

Support parsing signals based on container's platform #5931

Open katiewasnothere opened 3 years ago

katiewasnothere commented 3 years ago

What is the problem you're trying to solve Containers run that rely on containerd and containerd's packages may be running a platform different than the host. For example, linux containers on windows, which run using a lightweight linux virtual machine. When sending signals to a container, we should use signals for the container's platform, not the host's.

Describe the solution you'd like We parse signals based on the container's platform.

Additional context I previously made a PR to accomplish this here, but it fell off and eventually was closed. Recently, this PR was made to remove signal parsing in containerd in favor of using moby/sys platform parsing in here.

katiewasnothere commented 3 years ago

@thaJeztah and @kzys

katiewasnothere commented 3 years ago

I'm not really sure what makes sense to do here. I don't think it makes sense to implement something like this in moby/sys/signal but to implement it in containerd would mean that we'd need to hardcode signal values (since we couldn't use build constraints to use unix or windows official golang packages), which defeats the point of previous work https://github.com/containerd/containerd/pull/5693 @kzys.

kzys commented 3 years ago

The current signal parsing logic is designed to be compatible with Docker. Especially SIGRTMIN handling, it is compatible with neither the host or the container, if they use musl libc.

https://github.com/containerd/containerd/pull/5693#discussion_r665524873

However I think this should be a static value anyway since containerd isn't doing anything except forwarding it along. Having this value change depending on what the host userspace is linked against doesn't really make sense since it is the container process that has to process the signal.

It should probably also just be a copy of what Docker does since this is going to be the expectation from build->deploy

Regarding LCOW, how much does containerd know the fact that LCOW run Linux VMs? Should it be handled by the shim or an OCI runtime (runhcs)?

katiewasnothere commented 3 years ago

Afaik, it doesn't know that LCOW runs using a VM. However, it does know what platform the pod is running, which was part of my previous solution to parse based on the container's platform (see my closed linked PR above). Moving signal parsing to the shim or runtime would break with the current agreement between containerd and shims though, right?

(side note: LCOW containers using containerd use runc as the runtime now 🙂)

thaJeztah commented 3 years ago

This one is tricky indeed. I know we used the string representation in Docker, so that a user can pass --signal=SIGSOMETHING from the CLI (which could be running Windows, Mac, or Linux), and have the runtime do the translation.

LCOW adds another "mismatch" in that respect, because the host platform doesn't match the container's platform. I know we had some discussions about that with LCOW in the Docker Engine (i.e. does the VM represent the container, or the process inside the VM?).

I (personally) like having these translations happen as close to the actual container as possible, to keep some abstraction (I hope I'm able to phrase this in a meaningful way); basically (where possible) do not require "higher level" runtimes (containerd, docker engine) to be aware of platform specifics, and delegate that to (e.g.) runc (as it already will have to deal directly with the kernel). "higher level" runtimes can then generate a (somewhat) platform agnostic spec and assume that the right thing happens. Looks like the runtime spec also includes kill <signal>, so from that perspectice it should be possible, but the runc code is using unix.SignalNum(), so I doubt that will support the SIGRTMIN signals; https://github.com/opencontainers/runc/blob/5fb9b2a006725d468381de8d300bb9baecc726f7/kill.go#L65

func parseSignal(rawSignal string) (unix.Signal, error) {
    s, err := strconv.Atoi(rawSignal)
    if err == nil {
        return unix.Signal(s), nil
    }
    sig := strings.ToUpper(rawSignal)
    if !strings.HasPrefix(sig, "SIG") {
        sig = "SIG" + sig
    }
    signal := unix.SignalNum(sig)
    if signal == 0 {
        return -1, fmt.Errorf("unknown signal %q", rawSignal)
    }
    return signal, nil
}

I must admit I'm not sure where these signals are used directly (in containerd), for example the Process.Kill() interface looks to define the signal as a syscall.Signal (which would complicate this if the container is a different platform as the host); https://github.com/containerd/containerd/blob/a0ae24b98494c2d855e7c8c5b899f363a3d2b181/process.go#L42

katiewasnothere commented 3 years ago

I agree, that seems like a reasonable mindset @thaJeztah.

In particular, CRI attempts to parse signals before calling Process.Kill, see here. Snippet below:

                stopSignal := "SIGTERM"
        if container.StopSignal != "" {
            stopSignal = container.StopSignal
        } else {
            // The image may have been deleted, and the `StopSignal` field is
            // just introduced to handle that.
            // However, for containers created before the `StopSignal` field is
            // introduced, still try to get the stop signal from the image config.
            // If the image has been deleted, logging an error and using the
            // default SIGTERM is still better than returning error and leaving
            // the container unstoppable. (See issue #990)
            // TODO(random-liu): Remove this logic when containerd 1.2 is deprecated.
            image, err := c.imageStore.Get(container.ImageRef)
            if err != nil {
                if err != store.ErrNotExist {
                    return errors.Wrapf(err, "failed to get image %q", container.ImageRef)
                }
                log.G(ctx).Warningf("Image %q not found, stop container with signal %q", container.ImageRef, stopSignal)
            } else {
                if image.ImageSpec.Config.StopSignal != "" {
                    stopSignal = image.ImageSpec.Config.StopSignal
                }
            }
        }
        sig, err := signal.ParseSignal(stopSignal)
        if err != nil {
            return errors.Wrapf(err, "failed to parse stop signal %q", stopSignal)
        }

Which I'm concerned about since even if we don't parse the signal to syscall.Signal using ParseSignal we'd still need to potentially translate the signal into a uint value so we can pass it through task.kill

As for runc, we could update kill to use the moby/sys/signals package. For LCOW, runc runs in the hosting VM, so if we can get to this point for signal parsing (iow if we can get passed cri), then that would work for us.

thaJeztah commented 3 years ago

In particular, CRI attempts to parse signals before calling Process.Kill,

Yes, I'm wondering if it should be considered to make Process.Kill() accept a string instead of syscall.Signal (or drop the argument altogether and pass it as KillOpt)

katiewasnothere commented 3 years ago

Yes, I'm wondering if it should be considered to make Process.Kill() accept a string instead of syscall.Signal (or drop the argument altogether and pass it as KillOpt)

Is that something that the community would be okay with?

kzys commented 3 years ago

Since syscall package is tied to the host platform, migrating to string or int makes sense for supporting cross-platform containers such as LCOW. But this may need to wait containerd 1.6.x or containerd 2.0.x.

BTW StopSignal is a string in OCI Image Spec.

Regarding LCOW, is the below diagram accurate? If so, would the shim responsible for Windows -> Linux signal conversion?

Kubelet <--> CRI <--> containerd <--> LCOW shim <--> Linux VM --> runc
katiewasnothere commented 3 years ago

migrating to string or int makes sense for supporting cross-platform containers such as LCOW. But this may need to wait containerd 1.6.x or containerd 2.0.x.

That's fine. We use a fork of containerd today so as long as the change is in upstream main we can just cherry pick it as needed until it's in a release.

Regarding LCOW, is the below diagram accurate? If so, would the shim responsible for Windows -> Linux signal conversion? Kubelet <--> CRI <--> containerd <--> LCOW shim <--> Linux VM --> runc

@kzys Pretty much correct except that we use the same shim for LCOW and WCOW. Are you asking if today the shim is responsible for that conversion or suggesting to do that? The shim still runs on the host in the LCOW case but runc runs inside the linux VM.

katiewasnothere commented 3 years ago

Would it be acceptable to add a new field on the task API's KillRequest message named signal_name that takes a string representing the signal instead of trying to make a breaking change to the API?

thaJeztah commented 3 years ago

Since syscall package is tied to the host platform, migrating to string or int makes sense for supporting cross-platform containers such as LCOW. But this may need to wait containerd 1.6.x or containerd 2.0.x. ..

Would it be acceptable to add a new field on the task API's KillRequest message named signal_name that takes a string representing the signal instead of trying to make a breaking change to the API?

Arf. I wanted to reply to this.

Yes, I was thinking along those lines as well; I think with that approach it should be possible to add this in a backward compatible way;

@kzys @dmcgowan @cpuguy83 WDYT?

katiewasnothere commented 3 years ago

I'm going to go ahead and start working on that implementation then unless there are any objections from anyone

thaJeztah commented 3 years ago

Thanks! Works for me :+1:

I see I didn't mention it in my reponses, but for completeness / consideration:

I think SignalName should accept both "names" and "numbers", similar to how github.com/moby/sys/signal.ParseSignal handles them. This allows for well-known named signals to be specified, but (for flexibility) also accepts a numeric value (passed as string) for situations where some esoteric signal is needed. This might also help the migration to the new field (where the passed syscall.Signal can be passed both as syscall.Signal, and as a string containing its value, without the need to convert it back to a corresponding name).

If the above is implemented, perhaps we need a different name (KillSignal? StopSignal ? naming is hard!).

kzys commented 3 years ago

Let me summarize my thoughts. Here is what we have on LCOW

containerd <--(1)--> runhcs-shim <--(2)--> Linux VM <--(3)--> runc

Option 1: Pass string signals from 1 to 3

Pros

Cons

Option 2: Pass string signals in 1. runhcs-shim converts that to integers since it knows the underlying platform while containers are Linux

Pros

Cons

thaJeztah commented 3 years ago

runc is a reference implementation and there are other implementations which are written by other languages (crun in C, youki in Rust). They have to know moby/sys's signal parsing logic.

I think with the proposed change, they'd still be able to pass the signal as it is today (numeric value) and things continue to work.

runc and other implementations which runs on Linux don't have to deal with the cross-platform signal handling.

Afaics, runc is already dealing with conversion of signal names to numbers (it gets the signal passed as a string (command line arguments are always strings), and accepts both names and numbers. The only difference would be that using "moby/signal" extends the list of known signals to add the SIGRTMIN signals (alternatively, we should consider having those added to golang.org/x/sys)

kzys commented 3 years ago

The only difference would be that using "moby/signal" extends the list of known signals to add the SIGRTMIN signals (alternatively, we should consider having those added to golang.org/x/sys)

moby/signal is slightly "opinionated" than runc or golang.org/x/sys regarding SIGRTMIN handling. The actual value is depending on libc (musl and glibc have the different numeric value). Pushing that to golang.org/x/sys may be hard.

Because of that, I'm bit hesitant to have the logic in runc. runc is the reference implementation and other implementations may need to do what runc does to be compatible with, even the conversation logic is not covered by the OCI Runtime Spec. As a result, we may make the container world somewhat incompatible with musl-based Linux distros such as Alpine.

While I don't think it would cause a huge production pain, I'm unsure that this is the right solution.

cpuguy83 commented 3 years ago

That's the thing though. This is translating syscall numbers that the container will be handling. We are just sending them. The string->syscall number conversion needs to be standardized and cannot depend on go vs musl vs glibc vs whatever.

The container author knows best here, we just need to send a consistent value.

kzys commented 3 years ago

Thanks. I've updated the pros/cons above to add the following;

containerd doesn't have to deal with the conversion logic anymore. This would help other CRI implementations such as CRI-O.

Is this correct? It is bit different from what @thaJeztah said though.

I think with the proposed change, they'd still be able to pass the signal as it is today (numeric value) and things continue to work.

We are going to clarify the conversion logic in the OCI Runtime Spec and other OCI runtimes would follow the change. The only difference is about SIGRTMIN handling and most containers wouldn't be affected though.

katiewasnothere commented 3 years ago

I'm not sure I understand what you mean by this. How would option two allow us to standardize behavior through OCI?

We are going to clarify the conversion logic in the OCI Runtime Spec and other OCI runtimes would follow the change.

Clarify in what way?

kzys commented 3 years ago

Sorry, I put the statement in the wrong section. Option 1 allows/enforces us to standardize the behavior through OCI.

Right now, OCI Runtime Spec describes the behavior like below;

https://github.com/opencontainers/runtime-spec/blob/20a2d9782986ec2a7e0812ebe1515f73736c6a0c/runtime.md#kill

This operation MUST generate an error if it is not provided the container ID. Attempting to send a signal to a container that is neither created nor running MUST have no effect on the container and MUST generate an error. This operation MUST send the specified signal to the container process.

But Linux manpages describes real-time signals like below;

https://www.man7.org/linux/man-pages/man7/signal.7.html

       The Linux kernel supports a range of 33 different real-time
       signals, numbered 32 to 64.  However, the glibc POSIX threads
       implementation internally uses two (for NPTL) or three (for
       LinuxThreads) real-time signals (see pthreads(7)), and adjusts
       the value of SIGRTMIN suitably (to 34 or 35).  Because the range
       of available real-time signals varies according to the glibc
       threading implementation (and this variation can occur at run
       time according to the available kernel and glibc), and indeed the
       range of real-time signals varies across UNIX systems, programs
       should never refer to real-time signals using hard-coded numbers,
       but instead should always refer to real-time signals using the
       notation SIGRTMIN+n, and include suitable (run-time) checks that
       SIGRTMIN+n does not exceed SIGRTMAX.

Using moby/sys in runc means that we are going to do what the manpage doesn't recommend.

programs should never refer to real-time signals using hard-coded numbers.

So I think this should be documented and standardized for other OCI runtimes. It clarifies the handling of real-time signals in the spec.

katiewasnothere commented 3 years ago

Don't we already pass SIGRTMIN+n signals as integers to CRI?

kzys commented 3 years ago

Don't we already pass SIGRTMIN+n signals as integers to CRI?

I'm unsure.

containerd's CRI plugin is calling ParseSignal. 4 files (including one test) in containerd are calling the function so far.

https://github.com/containerd/containerd/search?q=ParseSignal

I think, we need to remove these if runc is responsible for the conversion.

cyphar commented 3 years ago

programs should never refer to real-time signals using hard-coded numbers.

The only issue is this recommendation was made by glibc but doesn't handle the case where a glibc process sends a signal to a musl process. So if we wanted to handle this perfectly in runc we would need to try to detect which libc a program is using, which is going to be as close to impossible as you can get when you consider static linking as well as symbol stripping (maybe we could check whether it has @GLIBC_ symbol version names?) and I'm not in the mood to start doing symbol dumps and analysing them from within runc just to figure out whether to use signal 37 or 38. What a mess...

I wonder how the LXC folks handle this... @brauner How do you handle the fact that SIGRTMIN is different based on the libc used by the process? Or is this something we've all collectively ignored because it's such a silly problem that nobody (myself included) thought to check whether this is an incompatibility that exists.

brauner commented 3 years ago

LXC uses real time signals only for systemd, i.e. when the container runs systemd and no specific signal has been specified we will send the relevant real-time signal to systemd. Since systemd doesn't build with musl we don't have issues with libc differences.

We do signal forwarding in e.g. LXD but it is restricted to well-defined signals such as SIGWINCH, SIGUSR{1,2}, SIGINT etc. not real-time signals.

brauner commented 3 years ago

LXC uses real time signals only for systemd, i.e. when the container runs systemd and no specific signal has been specified we will send the relevant real-time signal to systemd. Since systemd doesn't build with musl we don't have issues with libc differences.

Because systemd's shutdown sequence is based on real-time signals I should add.

brauner commented 3 years ago

Also note that SIGRT{MIN,MAX} aren't constants but function calls.

brauner commented 3 years ago

The problem exists there too though in so far as it is possible that glibc inside and outside of a container could technically alter SIGRTMIN and differ. So one thing I talked to Lennart about was expressing it in terms of SIGRTMAX- instead of SIGRTMIN+ because presumably SIGRTMAX is stable but it's not perfect either.