canonical / pebble

Take control of your internal daemons!
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
143 stars 54 forks source link

Pebble shutting down on error return 0 exit code #324

Closed deusebio closed 9 months ago

deusebio commented 10 months ago

When a Pebble service (that has on-failure: shutdown) fails, therefore leading also Pebble (and the container) to shut down, the exit code is 0.

This does not allow to externally handle failure of processes managed by pebble or containers (e.g. k8s)

It would be nice that if pebble is shut down becuase of a service process failing, the exiting code gets propagated to pebble exit code as well, such that it can be externally handled and leveraged on.

To Reproduce

I have simply created a vanilla example, moving from the hello-world application found in the documentation:

name: hello
summary: Hello World
description: The most basic example of a ROCK.
version: "1.0"
license: Apache-2.0

base: ubuntu@22.04
platforms:
  amd64:  # Make sure this value matches your computer's architecture

services:
  # Failing service
  hello:
    override: replace
    command: /bin/bash -c "sleep 5; exit 1"
    startup: enabled
    on-failure: shutdown

parts:
  hello:
    plugin: nil
    stage-packages:
      - hello_bins

After creating the image and importing this in docker using skopeo, I run the image:

❯ docker run hello
2023-11-04T10:40:01.205Z [pebble] Started daemon.
2023-11-04T10:40:01.207Z [pebble] POST /v1/services 1.617211ms 202
2023-11-04T10:40:01.207Z [pebble] Started default services with change 1.
2023-11-04T10:40:01.208Z [pebble] Service "hello" starting: /bin/bash -c "sleep 5; exit 1"
2023-11-04T10:40:06.216Z [pebble] Service "hello" stopped unexpectedly with code 1
2023-11-04T10:40:06.216Z [pebble] Service "hello" on-failure action is "shutdown", triggering server exit
2023-11-04T10:40:06.216Z [pebble] Server exiting!

❯ echo $?
0
benhoyt commented 10 months ago

Per yesterday's discussion, this is by design as the "shutdown" action is specified, so Pebble is doing what it was told and hence using the zero exit code. We probably wouldn't wire the exit code through directly, as there are (or may be) multiple services involved.

We'll definitely consider adding a way to have Pebble exit with a nonzero exit code in this case, but we'll need to have a (very short) spec and discuss first. My initial suggestion is to have a new "action", shutdown-nonzero (we can bikeshed on the naming).

In the meantime to fix the customer issue, you can use the technique we looked at yesterday, where the ROCK entrypoint script does kill -9 1 (send SIGKILL to Pebble) if the process being run exits with nonzero exit code. @cjdcordeiro will post the code for that here for the record.

cjdcordeiro commented 10 months ago

The proposal for such a workaround (using kill) is proposed in https://github.com/canonical/charmed-spark-rock/pull/53/files

deusebio commented 9 months ago

Just a small update, as sending the SIGKILL to process with PID 1 won't work, so the advised workaround here would be to send another signal (e.g. SIGHUP), as done here

benhoyt commented 9 months ago

For reference, @deusebio created spec DA062 to brainstorm the solution for this. After much discussion, we've settled on updating on-failure: shutdown to return a nonzero code (specifically 10 meaning "service failed"), and adding options for on-failure: success-shutdown and on-success: failure-shutdown when you want the opposite of the default meaning.

benhoyt commented 9 months ago

This is included in v1.6.0 which I just released: https://github.com/canonical/pebble/releases/tag/v1.6.0

Though note there seems to be some failures building the Snap for non-amd64 architectures @cjdcordeiro