Stuck forever in `down` command (despite `shutdown.timeout_seconds`)

mprimi commented 1 month ago

This is a follow-up to https://github.com/F1bonacc1/process-compose/issues/251 where i was seeing call to list getting stuck. Since then I have added a shutdown.timeout_seconds parameter to ensure timely shutdown. However I am now seeing down getting stuck forever, e.g..:

$ ps aux | grep process-compose
myuser  480743  0.1  1.1 1917956 44684 pts/1   Sl+  15:48   0:13 process-compose -L /[...]/process-compose.log --config ./config/[...].yml --port 18080 --tui=false up
myuser  481605  0.0  0.9 1622260 37120 pts/1   Sl+  15:53   0:00 process-compose --port 18080 down

In this configuration, i am running 6 processes.

5 of them shut down without issue
1 of them is still alive under that up command.

I expected shutdown.timeout_seconds (set to 5 seconds) to kick in and KILL the culprit. However the down process has been stuck where it is for the last hour.

$ sudo cat /proc/480743/stack
[<0>] futex_wait_queue+0x67/0xa0
[<0>] futex_wait+0x151/0x270
[<0>] do_futex+0x178/0x230
[<0>] __x64_sys_futex+0x78/0x1f0
[<0>] x64_sys_call+0x1c47/0x2570
[<0>] do_syscall_64+0x56/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x73/0xdd

 sudo $(which gdb) -batch -ex bt $(which process-compose) 480743
[New LWP 481115]
[New LWP 480892]
[New LWP 480751]
[New LWP 480750]
[New LWP 480749]
[New LWP 480748]
[New LWP 480747]
[New LWP 480746]
[New LWP 480745]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/3dyw8dzj9ab4m8hv5dpyx7zii8d0w6fi-glibc-2.39-52/lib/libthread_db.so.1".
0x0000000000476c43 in ?? ()
#0  0x0000000000476c43 in ?? ()
#1  0x000000000043b830 in ?? ()
#2  0x0000000001cc2020 in ?? ()
#3  0x0000000000000080 in ?? ()
#4  0x0000000000000000 in ?? ()
[Inferior 1 (process 480743) detached]

Unfortunately not much to go on here...

Current repro (maybe you can repro this yourself):

Create configuration with multiple processes
Set shutdown.timeout_seconds for all of them
Modify one of the processes to ignore SIGTERM (e.g. force the shutdown to timeout and issue a SIGKILL)
Start-stop a few times and possibly observe down getting stuck

This is reproducing for me with v1.27 on Linux.

Happy to capture any more data you think is interesting, or to try other configuration settings.

F1bonacc1 commented 1 month ago

Hi @mprimi,

Thank you for such a detailed analysis, this is very helpful. I created this project:

version: "0.5"
log_level: info
log_length: 300

processes:
  pc_log:
    command: "tail -f -n100 process-compose-${USER}.log"
    working_dir: "/tmp"
    shutdown:
      timeout_seconds: 5

  sigterm_resistant:
    command: "trap '' SIGTERM && sleep 20"
    shutdown:
      timeout_seconds: 5

  sleeper:
    command: "sleep 60"
    replicas: 5
    shutdown:
      timeout_seconds: 5

Started it about 20 times and stopped it with the down command. Each time the project stopped as expected. This is the script I used:

for i in {1..20} ; do ./bin/process-compose up -f issues/issue_258/process-compose.yaml -t=false & sleep 1 && ./bin/process-compose down ; done

Are there any additional details we might have missed? Maybe there is something specific to your process-compose.yaml? Would you be able to publish it or try to repro with mine?

mprimi commented 1 month ago

Thank you for looking into this.

I also have been trying to repro with a variant of TestSystem_TestProcShutDownWithConfiguredTimeOut, unsuccessfully so far.

Nevertheless our CI pipes (1.27 on Linux via nix) get stuck daily on up and down blocked on one process that doesn't shut down.

I'll keep trying to see if i can learn more from it and report back.

Here's a snip of the configuration I'm using:

  Server_s1:
    namespace: "cluster"
    command: "[...]/server -c ./config/s1.conf"
    log_location: /tmp/s1.log
    working_dir: "."
    availability:
      restart: "always"
    liveness_probe:
      initial_delay_seconds: 5
      period_seconds: 1
      timeout_seconds: 3
      success_threshold: 1
      failure_threshold: 3
      http_get:
        host: 127.0.0.1
        scheme: http
        path: "/"
        port: 18221
    readiness_probe:
      initial_delay_seconds: 5
      period_seconds: 1
      timeout_seconds: 2
      success_threshold: 1
      failure_threshold: 8
      http_get:
        host: 127.0.0.1
        scheme: http
        path: "/health"
        port: 18221
    shutdown:
      timeout_seconds: 5

(more servers with the same configuration omitted)

F1bonacc1 commented 1 month ago

This is very helpful. I suspect it might be related to the readiness probe not being Ready when the down command is issued. I will find time soon to confirm and fix it.

Lagoja commented 1 month ago

We're seeing a similar issue with Devbox using our process-compose.yaml for Postgresql:

version: "0.5"

processes:
  postgresql:
    command: "pg_ctl start -o \"-k $PGHOST\""
    is_daemon: true
    shutdown:
      command: "pg_ctl stop -m fast"
    availability:
      restart: "always"
    readiness_probe:
      exec:
        command: "pg_isready"

It looks like the progresql process here gets stuck in Launching, and then process-compose never exits, similar to what's described above.

Trying different versions, it looks like this bug was introduced sometime after 1.24.2 -- If I run the process-compose file above with 1.24.2, I get a Launched status and everything works fine. On 1.27 and higher, the process gets stuck at Launching

F1bonacc1 commented 1 month ago

Hi @Lagoja, This was very helpful in pinpointing the issues (yes there is more than one):

Issue 1

Process Compose skipped termination of a Launching daemon. It's an easy issue to fix and this will prevent PC from getting stuck when trying to stop it while in the Launching state.

Issue 2

But why did pc_ctl remain in the Launching state in the first place? The root cause is a fix introduced as part of #237. In this fix, Process Compose waits for stdout and stderr to finish (EOF) before waiting for the process to end. In this sense, pc_ctl behaves differently than other processes. It keeps its stdout open even after the process is terminated but is not waited on (this is what collects its status to prevent zombies). Please notice the timestamps: 5 minutes after the process is "completed" it still sends output to stdout.

>ps -aef | grep 80415
eugene     80415   80400  0 14:25 pts/1    00:00:00 [pg_ctl] <defunct> # pg_ctl is a zombie

Possible Fixes

I can add a new timeout variable for daemon starters that will wait for std[out,err] to close and continue if they didn't in that given time. Default value 5s if not specified.
Revert to the previous logic: don't wait for stdout to close, but sleep. I can make it longer, to reduce the (very low) risk of missing log lines.
I looked at the pg_ctl source code to find if there are any flags that can change this behavior but didn't find anything helpful.

Personally, I prefer the 1st option, but I will be happy to get feedback from the users.

CC: @thenonameguy, @mkenigs

mprimi commented 2 weeks ago

@F1bonacc1, thank you for investigating this.

While it seems likely the problems are related, the issue I originally reported is not exactly what you identified:

I am not using daemon processes
The process I'm shutting down is not in a zombie state

Since I can reproduce, please let me know of any additional info I can capture.

The configuration I posted above is still current.

Overview:

Launch 5+1 processes with up
Some of my processes get killed or die, that's expected. I rely on process-compose to bring them back up
At some point I issue a down which gets stuck forever

While in this state, process list also hangs forever. If I try to attach, it shows 0 processes, but it's probably stuck waiting on the daemon to respond. Looking at ps/pstree shows that up is still running along with one of my processes:

$  ps aux | grep process-compose
myuser  230939  0.1  1.1 1844196 45872 ?       Sl   00:08   1:06 process-compose -L /tmp/[...]/process-compose.log --config ./config/process-compose/5-nodes.yml --port 18080 --tui=false up
myuser  243648  0.0  0.9 1622488 38016 ?       Sl   04:12   0:00 process-compose --port 18080 down

The down command above has been stuck there for hours.

Looking with pstree shows 1 of the 5 servers still running:

$ pstree -p -T 230939
process-compose(230939)───my-server(231308)

Here's the tail of process-compose log:

24-11-05 00:08:45.471 INF Restarting Server_s3 in 1 second(s)... Restarts: 1
24-11-05 00:08:46.472 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s3.conf"] process=Server_s3
24-11-05 00:08:52.495 INF Exited exit_code=-1 process=Server_s2
24-11-05 00:08:52.495 INF Restarting Server_s2 in 1 second(s)... Restarts: 1
24-11-05 00:08:53.497 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s2.conf"] process=Server_s2
24-11-05 00:09:05.545 INF Exited exit_code=-1 process=Server_s5
24-11-05 00:09:05.545 INF Restarting Server_s5 in 1 second(s)... Restarts: 1
24-11-05 00:09:06.547 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s5.conf"] process=Server_s5
24-11-05 00:09:13.562 INF Exited exit_code=-1 process=Server_s1
24-11-05 00:09:13.562 INF Restarting Server_s1 in 1 second(s)... Restarts: 1
24-11-05 00:09:14.564 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s1.conf"] process=Server_s1
24-11-05 00:09:20.598 INF Exited exit_code=-1 process=Server_s2
24-11-05 00:09:20.598 INF Restarting Server_s2 in 1 second(s)... Restarts: 2
24-11-05 00:09:20.878 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s2.conf"] process=Server_s2
24-11-05 00:09:20.899 INF Exited exit_code=1 process=Server_s3
24-11-05 00:09:20.929 INF Exited exit_code=1 process=Server_s4
24-11-05 00:09:20.936 INF Exited exit_code=1 process=Server_s5
24-11-05 00:09:20.941 INF Exited exit_code=0 process=Proxy
24-11-05 00:09:20.952 INF Exited exit_code=1 process=Server_s1
24-11-05 01:09:00.422 INF Caught terminated - Shutting down the running processes...

Notice processes are dying and getting restarted (as expected).

Very interesting that the one last restarted (Server_s2 just 20ms before everything is torn down) is the one still active (i.e. pid 231308 above).

Seems likely that this process was not "ready" when the down command was issued.

I tried to peek a the a stack trace (via SIGQUIT), but could not find where stderr went (not in syslog). Although if you are stumped I can probably figure out how to capture one.

mprimi commented 2 weeks ago

Confirming issue is fixed in v1.40.0. Thank you @F1bonacc1

mkenigs commented 1 week ago

CC: @thenonameguy, @mkenigs

@F1bonacc1 thanks for the ping and sorry I missed that earlier! Glad to see you already merged a fix

F1bonacc1 / process-compose