Closed mprimi closed 2 weeks ago
Hi @mprimi,
Thank you for such a detailed analysis, this is very helpful. I created this project:
version: "0.5"
log_level: info
log_length: 300
processes:
pc_log:
command: "tail -f -n100 process-compose-${USER}.log"
working_dir: "/tmp"
shutdown:
timeout_seconds: 5
sigterm_resistant:
command: "trap '' SIGTERM && sleep 20"
shutdown:
timeout_seconds: 5
sleeper:
command: "sleep 60"
replicas: 5
shutdown:
timeout_seconds: 5
Started it about 20 times and stopped it with the down
command. Each time the project stopped as expected.
This is the script I used:
for i in {1..20} ; do ./bin/process-compose up -f issues/issue_258/process-compose.yaml -t=false & sleep 1 && ./bin/process-compose down ; done
Are there any additional details we might have missed?
Maybe there is something specific to your process-compose.yaml
? Would you be able to publish it or try to repro with mine?
Thank you for looking into this.
I also have been trying to repro with a variant of TestSystem_TestProcShutDownWithConfiguredTimeOut
, unsuccessfully so far.
Nevertheless our CI pipes (1.27 on Linux via nix
) get stuck daily on up
and down
blocked on one process that doesn't shut down.
I'll keep trying to see if i can learn more from it and report back.
Here's a snip of the configuration I'm using:
Server_s1:
namespace: "cluster"
command: "[...]/server -c ./config/s1.conf"
log_location: /tmp/s1.log
working_dir: "."
availability:
restart: "always"
liveness_probe:
initial_delay_seconds: 5
period_seconds: 1
timeout_seconds: 3
success_threshold: 1
failure_threshold: 3
http_get:
host: 127.0.0.1
scheme: http
path: "/"
port: 18221
readiness_probe:
initial_delay_seconds: 5
period_seconds: 1
timeout_seconds: 2
success_threshold: 1
failure_threshold: 8
http_get:
host: 127.0.0.1
scheme: http
path: "/health"
port: 18221
shutdown:
timeout_seconds: 5
(more servers with the same configuration omitted)
This is very helpful.
I suspect it might be related to the readiness probe not being Ready
when the down
command is issued.
I will find time soon to confirm and fix it.
We're seeing a similar issue with Devbox using our process-compose.yaml for Postgresql:
version: "0.5"
processes:
postgresql:
command: "pg_ctl start -o \"-k $PGHOST\""
is_daemon: true
shutdown:
command: "pg_ctl stop -m fast"
availability:
restart: "always"
readiness_probe:
exec:
command: "pg_isready"
It looks like the progresql process here gets stuck in Launching
, and then process-compose never exits, similar to what's described above.
Trying different versions, it looks like this bug was introduced sometime after 1.24.2 -- If I run the process-compose file above with 1.24.2, I get a Launched
status and everything works fine. On 1.27 and higher, the process gets stuck at Launching
Hi @Lagoja, This was very helpful in pinpointing the issues (yes there is more than one):
Process Compose skipped termination of a Launching
daemon. It's an easy issue to fix and this will prevent PC from getting stuck when trying to stop it while in the Launching
state.
But why did pc_ctl
remain in the Launching
state in the first place?
The root cause is a fix introduced as part of #237.
In this fix, Process Compose waits for stdout
and stderr
to finish (EOF
) before waiting for the process to end.
In this sense, pc_ctl
behaves differently than other processes. It keeps its stdout open even after the process is terminated but is not waited on (this is what collects its status to prevent zombies).
Please notice the timestamps:
5 minutes after the process is "completed" it still sends output to stdout
.
>ps -aef | grep 80415
eugene 80415 80400 0 14:25 pts/1 00:00:00 [pg_ctl] <defunct> # pg_ctl is a zombie
std[out,err]
to close and continue if they didn't in that given time. Default value 5s if not specified.pg_ctl
source code to find if there are any flags that can change this behavior but didn't find anything helpful.Personally, I prefer the 1st option, but I will be happy to get feedback from the users.
CC: @thenonameguy, @mkenigs
@F1bonacc1, thank you for investigating this.
While it seems likely the problems are related, the issue I originally reported is not exactly what you identified:
daemon
processesSince I can reproduce, please let me know of any additional info I can capture.
The configuration I posted above is still current.
Overview:
up
process-compose
to bring them back updown
which gets stuck foreverWhile in this state, process list
also hangs forever.
If I try to attach
, it shows 0 processes, but it's probably stuck waiting on the daemon to respond.
Looking at ps
/pstree
shows that up
is still running along with one of my processes:
$ ps aux | grep process-compose
myuser 230939 0.1 1.1 1844196 45872 ? Sl 00:08 1:06 process-compose -L /tmp/[...]/process-compose.log --config ./config/process-compose/5-nodes.yml --port 18080 --tui=false up
myuser 243648 0.0 0.9 1622488 38016 ? Sl 04:12 0:00 process-compose --port 18080 down
The down
command above has been stuck there for hours.
Looking with pstree
shows 1 of the 5 servers still running:
$ pstree -p -T 230939
process-compose(230939)───my-server(231308)
Here's the tail of process-compose log:
24-11-05 00:08:45.471 INF Restarting Server_s3 in 1 second(s)... Restarts: 1
24-11-05 00:08:46.472 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s3.conf"] process=Server_s3
24-11-05 00:08:52.495 INF Exited exit_code=-1 process=Server_s2
24-11-05 00:08:52.495 INF Restarting Server_s2 in 1 second(s)... Restarts: 1
24-11-05 00:08:53.497 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s2.conf"] process=Server_s2
24-11-05 00:09:05.545 INF Exited exit_code=-1 process=Server_s5
24-11-05 00:09:05.545 INF Restarting Server_s5 in 1 second(s)... Restarts: 1
24-11-05 00:09:06.547 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s5.conf"] process=Server_s5
24-11-05 00:09:13.562 INF Exited exit_code=-1 process=Server_s1
24-11-05 00:09:13.562 INF Restarting Server_s1 in 1 second(s)... Restarts: 1
24-11-05 00:09:14.564 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s1.conf"] process=Server_s1
24-11-05 00:09:20.598 INF Exited exit_code=-1 process=Server_s2
24-11-05 00:09:20.598 INF Restarting Server_s2 in 1 second(s)... Restarts: 2
24-11-05 00:09:20.878 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s2.conf"] process=Server_s2
24-11-05 00:09:20.899 INF Exited exit_code=1 process=Server_s3
24-11-05 00:09:20.929 INF Exited exit_code=1 process=Server_s4
24-11-05 00:09:20.936 INF Exited exit_code=1 process=Server_s5
24-11-05 00:09:20.941 INF Exited exit_code=0 process=Proxy
24-11-05 00:09:20.952 INF Exited exit_code=1 process=Server_s1
24-11-05 01:09:00.422 INF Caught terminated - Shutting down the running processes...
Notice processes are dying and getting restarted (as expected).
Very interesting that the one last restarted (Server_s2
just 20ms before everything is torn down) is the one still active (i.e. pid 231308
above).
Seems likely that this process was not "ready" when the down
command was issued.
I tried to peek a the a stack trace (via SIGQUIT), but could not find where stderr
went (not in syslog).
Although if you are stumped I can probably figure out how to capture one.
CC: @thenonameguy, @mkenigs
@F1bonacc1 thanks for the ping and sorry I missed that earlier! Glad to see you already merged a fix
This is a follow-up to https://github.com/F1bonacc1/process-compose/issues/251 where i was seeing call to
list
getting stuck. Since then I have added ashutdown.timeout_seconds
parameter to ensure timely shutdown. However I am now seeingdown
getting stuck forever, e.g..:In this configuration, i am running 6 processes.
up
command.I expected
shutdown.timeout_seconds
(set to 5 seconds) to kick in and KILL the culprit. However thedown
process has been stuck where it is for the last hour.Unfortunately not much to go on here...
Current repro (maybe you can repro this yourself):
shutdown.timeout_seconds
for all of themdown
getting stuckThis is reproducing for me with v1.27 on Linux.
Happy to capture any more data you think is interesting, or to try other configuration settings.