canonical / pebble

Pebble is a lightweight Linux service manager with layered configuration and an HTTP API.
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
144 stars 54 forks source link

Pebble service is not restarted on `on-check-failure` if the OS process was killed #149

Closed marceloneppel closed 1 year ago

marceloneppel commented 2 years ago

Issue

If the OS process related to a pebble service is killed (like using pkill --signal SIGKILL -f process_name) and that service has a health check that instructs pebble to restart it on a check failure (on-check-failure field from the pebble layer spec), pebble tries to restart the process but finishes with an no such process error. The number of check failures continues to grow and pebble doesn't try to, for example, just start the service, as there is no old OS process for that service to be stopped before the service start again.

There are more details in an example repo that I used to test that behaviour. I added some more description and some logs comparing what happens with pebble and also when using another tool, in that case supervisor, to start the service and later kill the OS process.

Is there a way to solve that issue (making the service restart correctly using the pebble checks) using other fields from the pebble layer? I found a workaround that I implemented in my charm code, but I would like to confirm if this is a bug or not.

benhoyt commented 2 years ago

Hi @marceloneppel, thanks for the report. This is a strange one. The health check shouldn't actually be needed here, as the default behaviour for on-failure is restart. So as soon as SIGKILL is sent to the process you should see something like this in the Pebble log:

2022-09-21T22:57:05.707Z [pebble] DEBUG Reaper received SIGCHLD.
2022-09-21T22:57:05.707Z [pebble] DEBUG Reaped PID 113973 which exited with code 137.
2022-09-21T22:57:05.707Z [pebble] DEBUG Service "svc1" exited with code 137.
2022-09-21T22:57:05.707Z [pebble] Service "patroni" stopped unexpectedly with code 137
2022-09-21T22:57:05.707Z [pebble] Service "patroni" on-failure action is "restart", waiting ~500ms before restart (backoff 1)

Note that the DEBUG lines will only be visible if you start Pebble with the PEBBLE_DEBUG=1 environment variable set, but the others should show up in any case. The fact that this isn't showing up in your logs indicates that maybe the /usr/local/bin/patroni script is doing something funky -- maybe it's forking or going into daemon mode and confusing Pebble's child reaper?

Can you try a couple of things for me?

import http.server
import socketserver

PORT = 8081

Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print("serving at port", PORT)
    httpd.serve_forever()

And let me know whether you still see the same behaviour then.

If not it's something Patroni-specific, and we can dig into your repro case with petroni a bit deeper.

marceloneppel commented 2 years ago

Hi @benhoyt! Thanks for all this information.

I checked the things that you commented:

So, in summary, that is something specific to Patroni. I am going to get some more information and maybe create another simpler script that works similar to Patroni (like you commented, maybe running as a daemon or forking), so we can understand better what is happening in that case.

benhoyt commented 2 years ago

Were there any other logs after that last line? From following the code through: after DEBUG Reaped PID 18 ... I would have expected to see DEBUG Service "patroni" exited with code 137. and other logs.

marceloneppel commented 2 years ago

Were there any other logs after that last line? From following the code through: after DEBUG Reaped PID 18 ... I would have expected to see DEBUG Service "patroni" exited with code 137. and other logs.

No, unfortunately there were no more logs. I waited for ten minutes and checked the logs again. The last line in the logs is DEBUG Reaped PID 18 which exited with code 137.

marceloneppel commented 2 years ago

@benhoyt, I could create a simple script that shows the same behaviour that Patroni has when it is killed and Pebble doesn't restart it.

It is a script that starts another process using subprocess:

import subprocess
import time

subprocess.Popen(["/usr/bin/python3", "/home/app/script2.py"])

# Just a loop to keep the program running.
while True:
    time.sleep(30)

The other script that is started (script2.py) can be anything (in my case I used the TCP server code that you shared with me).

After killing the the first script process, I get the following logs:

2022-09-22T22:49:39.020Z [pebble] DEBUG socket "/home/app/pebble/.pebble.socket" was not activated; listening
2022-09-22T22:49:39.020Z [pebble] DEBUG Reaper started, waiting for SIGCHLD.
2022-09-22T22:49:39.020Z [pebble] DEBUG socket "/home/app/pebble/.pebble.socket.untrusted" was not activated; listening
2022-09-22T22:49:39.020Z [pebble] Started daemon.
2022-09-22T22:49:39.020Z [pebble] DEBUG activation done in 6ms
2022-09-22T22:49:39.021Z [pebble] DEBUG Configuring check manager (stopping 0, starting 0)
2022-09-22T22:49:39.024Z [pebble] DEBUG Running task 1 on Do: Start service "app"
2022-09-22T22:49:39.024Z [pebble] DEBUG pid=1;uid=0;socket=/home/app/pebble/.pebble.socket; POST /v1/services 2.660573ms 202
2022-09-22T22:49:39.024Z [pebble] POST /v1/services 2.660573ms 202
2022-09-22T22:49:39.024Z [pebble] Started default services with change 1.
2022-09-22T22:49:39.025Z [pebble] Service "app" starting: /usr/bin/python3 /home/app/script1.py
2022-09-22T22:49:39.026Z [pebble] DEBUG Service "app" started with PID 15
2022-09-22T22:49:39.026Z [pebble] DEBUG Service "app" transitioning to state "starting"
2022-09-22T22:49:40.027Z [pebble] DEBUG Service "app" transitioning to state "running"
2022-09-22T22:50:09.026Z [pebble] DEBUG Service "app" backoff reset elapsed, resetting backoff state (was 0: 0s)
2022-09-22T22:50:14.557Z [pebble] DEBUG Reaper received SIGCHLD.
2022-09-22T22:50:14.557Z [pebble] DEBUG Reaped PID 15 which exited with code 137.

You can check all the files here.

benhoyt commented 1 year ago

Thanks for the minimal repro -- I can repro this easily locally now. It will probably happen for any service that itself starts up a child process. What's happening is Pebble's "process reaper" detects that the parent PID exited, and then it calls Go's os/exec's cmd.Wait() to clean up ... but that waits for all the child processes (I guess the process group) to terminate.

So if you just send SIGKILL to the parent process, Pebble won't detect that it died ... until all the child processes die or are killed too.

You note that it's different from supervisord (I haven't tested systemd). I could be wrong, but this seems to me a feature, not a bug: it wouldn't be ideal to be restarting the parent process when there are running child processes still around. In the case of Patroni, it would mean Pebble would try to restart Patroni, which would restart a bunch of Postgres processes ... but Postgres is already running.

Is the right "fix" here for you to kill the process group instead of just the Patroni process? You can do this with kill -9 -54321, where 54321 is the process group ID, which in Linux is the PID of the parent process. I'm not sure why you're killing the process -- maybe you could give a bit more context -- but can you kill the process group instead, or do a more graceful stop or even use pebble stop?

Note that Pebble itself uses SIGTERM/SIGKILL on the process group when it executes a stop operation.

benhoyt commented 1 year ago

I've discussed this with @hpidcock, and he noted the systemd docs for KillMode and ExecStop -- it's a bit tricky to piece together, but they imply that in the case where the parent service is SIGKILL'd, systemd will send the SIGTERM...SIGKILL sequence to the remaining processes in the process group. This is when KillMode is set to control-group, the default, and what Pebble is trying to emulate.

I've verified this will a little test systemd service:

[Unit]
Description=Testy service

[Service]
ExecStart=/usr/bin/python3 /home/ben/w/pebble/script.py  # spawns script2.py
Type=simple

When you send SIGTERM or SIGKILL to the parent process and kill it, systemd sends SIGTERM to the child process too.

So between supervisor and systemd, that definitely seems like the right default behaviour. We'll need to fix this -- updating the service manager code to kick off the terminating state when it detects that child processes are still running.

In the meantime, I'd still love to hear more context about your use case here.

marceloneppel commented 1 year ago

Hi Ben! Thanks again for all this information about Pebble. It is great for me to understand some more aspects of it.

I haven't tried to kill the Patroni process group, but I can test it here. Besides that, nice that it's something that could be tested on systemd to validate the behavior.

Sorry, I forgot to give more details about my use case. It isn't a use case related to a feature that I need to implement, it is more related to a shakedown test that we need to implement in different database charmed operators that we are developing on the Data Platform team.

In my case, I need to test that the PostgreSQL charmed operator recovers correctly from some situations, like if the database process is killed, after it's freezed and later unfreezed, and also other situations that should cause some disruption in the service (for that I am currently using SIGKILL, SIGSTOP, SIGCONT and other signals, on both the Patroni process - the parent process - and the PostgreSQL process - the child process). The final idea is to test that a cluster of multiple units keeps working and the failing unit recovers itself (becoming part of the cluster again).

So in summary, this situation that I am trying to test is kind of a chaos scenario, to ensure that the charm (and Pebble) manages correctly the workload.

taurus-forever commented 1 year ago

Hi @benhoyt and @hpidcock, nice to meet you!

In the meantime, I'd still love to hear more context about your use case.

Let me summarize our use case:

As you confirmed, the current Pebble behavior is not fully replicate the default Systemd/Supervisor logic. Can you share estimates of the possible fix date/release version (for internal planning)?

Meanwhile, @marceloneppel is testing the workaround to send a KILL signal to the whole group instead of a single process. Anyway, to sleep well, we will need to be sure the Pebble will address the situation with killed Patroni (by user/oom/...). Unfortunately, we cannot prevent users from killing the "parent" process only.

Thank you in advance!

benhoyt commented 1 year ago

Thanks for the additional context @taurus-forever. I do hope to fix this at some point, however, for the next (few?) weeks we're finishing/fixing/testing Juju 3.0, so it'll almost certainly be after that.

marceloneppel commented 1 year ago

Hi @benhoyt and @taurus-forever!

Thanks for the updates. I tested killing the process group, but it still doesn't make Pebble restart the service. I researched why and figured out that the command (pg_ctl) that Patroni is using to start the child process (PostgreSQL) is forking and using setsid, so the child process has another group id. Sorry, I missed that information when I created the first minimal repo.

With those three scripts below we can check that the child process has another group:

script1.py

import subprocess
import time

subprocess.call(["/usr/bin/python3", "/home/app/script2.py"])

# Just a loop to keep the program running.
while True:
    time.sleep(30)

script2.py

import os
import subprocess

pid = os.fork()

# Pid equal to 0 represents the created child process.
if pid == 0:
    os.setsid()
    subprocess.Popen(["/usr/bin/python3", "/home/app/script3.py"])

script3.py

import http.server
import socketserver

PORT = 8081

Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print("serving at port", PORT)
    httpd.serve_forever()

Logs:

root@app-0:/# ps -e -o pid,ppid,pgid,cmd
    PID    PPID    PGID CMD
      1       0       1 /usr/bin/pebble run
     14       1      14 /usr/bin/python3 /home/app/script1.py
     17       1      16 /usr/bin/python3 /home/app/script3.py
     19       0      19 bash
     30      19      30 ps -e -o pid,ppid,pgid,cmd

So it seems to be the case that you commented in the beginning: Patroni is forking (and also moving the new process to a new group).

From what I checked, supervisord is restarting the service, but it keeps the original script3.py process running (that may be a problem in some cases, maybe).

benhoyt commented 1 year ago

Just for the record, some further reading on this subject that I came across today: "The Unix process API is unreliable and unsafe" and his supervise project. Also Hacker News comments and another related article. Also this comment from a guy who works on the Bazel build system with the exact same issue.

benhoyt commented 1 year ago

For the record, we plan to address this by using cgroups as part of work later in this cycle.

marceloneppel commented 1 year ago

Thanks for the update @benhoyt!

benhoyt commented 1 year ago

@marceloneppel Okay, I've come in a weird circle on this one. I had actually misdiagnosed the original problem with the Patroni issue you present above, where Patroni wasn't restarting properly even after a KILL signal was sent to it.

I thought this problem was the issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). But after actually diving into your patroni-restart-test repro in more detail, I found that Pebble wasn't marking the Patroni process as having exited at all, due to being blocked on cmd.Wait(). Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced WaitDelay to allow you to easily work around this exact problem, and I've added this in https://github.com/canonical/pebble/pull/275.

In my tests with Patroni that fixes the problem nicely: when you send KILL to the Patroni process, it uses WaitDelay to wait up to 5 seconds and then exits (even though the os/exec copying goroutines are stuck). This causes Pebble to restart Patroni, which sees that Postgres is already running, and keeps humming along nicely.

The reason I came in a weird circle with this is because I started down the track of "we need to use cgroups to solve this" too early. I even had a rough plan to do so, but it turns out using that cgroups in containers is not possible or not trivial in many cases, so that came to a head, and I started looking at other ways to solve the original problem, hence circling back around. I should have noticed this much earlier, sorry!

We may still improve Pebble's process tree handling, but I believe this takes the pressure off, and we may be able to get away without doing that.

In any case, my one-liner WaitDelay fix seems to solve the Patroni issue and similar issues. But I'd appreciate your review and testing with Patroni and your charms that were originally affected. It might be tricky to test using my Pebble branch, but I can give you instructions on how to do that if you want (a replace directive in Juju's go.mod and then building Juju manually).

marceloneppel commented 1 year ago

@marceloneppel Okay, I've come in a weird circle on this one. I had actually misdiagnosed the original problem with the Patroni issue you present above, where Patroni wasn't restarting properly even after a KILL signal was sent to it.

I thought this problem was the issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). But after actually diving into your patroni-restart-test repro in more detail, I found that Pebble wasn't marking the Patroni process as having exited at all, due to being blocked on cmd.Wait(). Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced WaitDelay to allow you to easily work around this exact problem, and I've added this in #275.

In my tests with Patroni that fixes the problem nicely: when you send KILL to the Patroni process, it uses WaitDelay to wait up to 5 seconds and then exits (even though the os/exec copying goroutines are stuck). This causes Pebble to restart Patroni, which sees that Postgres is already running, and keeps humming along nicely.

The reason I came in a weird circle with this is because I started down the track of "we need to use cgroups to solve this" too early. I even had a rough plan to do so, but it turns out using that cgroups in containers is not possible or not trivial in many cases, so that came to a head, and I started looking at other ways to solve the original problem, hence circling back around. I should have noticed this much earlier, sorry!

We may still improve Pebble's process tree handling, but I believe this takes the pressure off, and we may be able to get away without doing that.

In any case, my one-liner WaitDelay fix seems to solve the Patroni issue and similar issues. But I'd appreciate your review and testing with Patroni and your charms that were originally affected. It might be tricky to test using my Pebble branch, but I can give you instructions on how to do that if you want (a replace directive in Juju's go.mod and then building Juju manually).

Hi @benhoyt! No worries and thank you for the effort on this! I'm gonna check it with the charm this week to confirm whether it worked and keep you posted.

Thanks again!

marceloneppel commented 1 year ago

Thanks so much @benhoyt! I double checked with the charm and the issue was fixed.

To test, I used the following replace directive on Juju's go.mod file:

replace github.com/canonical/pebble => github.com/canonical/pebble v1.2.1-0.20230818120001-fd34da17bd83