arminbiere / runlim

Other
12 stars 4 forks source link

Does not work correctly when the shell script terminates quick but subprocesses keep running #7

Closed msoos closed 5 months ago

msoos commented 9 months ago

I just found a bug I think. Shell script:

#!/bin/bash
./cryptominisat complicated.cnf &
./cryptominisat complicated.cnf &
./cryptominisat complicated.cnf &
./cryptominisat complicated.cnf &

Then run as:

./runlim -t 10 ./myshell.sh

This will make runlim terminate immediately as if all was good. However, the 4 cryptominisat processes will keep running. I THINK this may be because it runs for less than 100ms and so runlim doesn't even do a single child check? Anyway, if I remove the last & then it correctly identifies all 4 subprocesses. Otherwise, all seems "fine and dandy" except that cryptominisat will keep running forever in 4 copies.

Output:

 ~/development/runlim  orig ?2  ./runlim myshell.sh                                                        ✔  4s 
[runlim] version:               2.0.0rc9
[runlim] host:                  tiresias
[runlim] time limit:            311040000 seconds
[runlim] real time limit:       311040000 seconds
[runlim] space limit:           64222 MB
[runlim] argv[0]:               myshell.sh
[runlim] start:                 Wed Nov  8 02:23:19 2023
[runlim] child:                 90605
[runlim] end:                   Wed Nov  8 02:23:19 2023
[runlim] status:                ok
[runlim] result:                0
[runlim] children:              0
[runlim] processes:             93
[runlim] real:                  0.01 seconds
[runlim] time:                  0.00 seconds
[runlim] space:                 0 MB
[runlim] load:                  0.00 maximum
[runlim] samples:               0

But after this successful output we have:

 ~/development/runlim  orig ?2  ps                                                                              ✔ 
    PID TTY          TIME CMD
  79593 pts/5    00:00:08 zsh
  79597 pts/5    00:00:00 zsh
  79664 pts/5    00:00:00 zsh
  79666 pts/5    00:00:00 zsh
  79667 pts/5    00:00:08 gitstatusd-linu
  90606 pts/5    00:00:19 cryptominisat5
  90607 pts/5    00:00:19 cryptominisat5
  90608 pts/5    00:00:19 cryptominisat5
  90609 pts/5    00:00:19 cryptominisat5
  90622 pts/5    00:00:00 ps

And top is showing CryptoMiniSat running.

I would consider this a major bug, I think?

arminbiere commented 9 months ago

Thanks ... for looking into that. I can confirm the issue:

$ cat two-zombies.sh 
#!/bin/sh
kissat test/cnf/hard.cnf -q &
kissat test/cnf/hard.cnf -q &
$ runlim  -t 1 ./two-zombies.sh 
[runlim] version:       2.0.0rc9
[runlim] host:          priel
[runlim] time limit:        1 seconds
[runlim] real time limit:   311040000 seconds
[runlim] space limit:       64075 MB
[runlim] argv[0]:       ./two-zombies.sh
[runlim] start:         Wed Nov  8 15:04:03 2023
[runlim] child:         157311
[runlim] end:           Wed Nov  8 15:04:03 2023
[runlim] status:        ok
[runlim] result:        0
[runlim] children:      0
[runlim] processes:     106
[runlim] real:          0.01 seconds
[runlim] time:          0.00 seconds
[runlim] space:         0 MB
[runlim] load:          0.00 maximum
[runlim] samples:       0
$ ps|grep kissat|grep -v grep
 157312 pts/4    00:00:02 kissat
 157313 pts/4    00:00:02 kissat
$ killall kissat
arminbiere commented 9 months ago

It works for this one as reported:

$ cat one-background-job-but-do-not-wait.sh 
#!/bin/sh
kissat test/cnf/hard.cnf -q &
kissat test/cnf/hard.cnf -q
$ runlim -t 1 ./one-background-job-but-do-not-wait.sh 
[runlim] version:       2.0.0rc9
[runlim] host:          priel
[runlim] time limit:        1 seconds
[runlim] real time limit:   311040000 seconds
[runlim] space limit:       64075 MB
[runlim] argv[0]:       ./one-background-job-but-do-not-wait.sh
[runlim] start:         Wed Nov  8 15:08:19 2023
[runlim] child:         157371
[runlim] end:           Wed Nov  8 15:08:21 2023
[runlim] status:        out of time
[runlim] result:        2
[runlim] children:      3
[runlim] processes:     107
[runlim] real:          1.73 seconds
[runlim] time:          1.20 seconds
[runlim] space:         19 MB
[runlim] load:          0.34 maximum
[runlim] samples:       6
$ ps|grep kissat|grep -v grep
$ 
arminbiere commented 9 months ago

As well as the shell script kept on waiting (does not produce zombies):

$ cat waiting.sh 
#!/bin/sh
kissat test/cnf/hard.cnf -q &
kissat test/cnf/hard.cnf -q &
wait
$ runlim -t 1 ./waiting.sh 
[runlim] version:       2.0.0rc9
[runlim] host:          priel
[runlim] time limit:        1 seconds
[runlim] real time limit:   311040000 seconds
[runlim] space limit:       64075 MB
[runlim] argv[0]:       ./waiting.sh
[runlim] start:         Wed Nov  8 15:09:55 2023
[runlim] child:         157424
[runlim] end:           Wed Nov  8 15:09:57 2023
[runlim] status:        out of time
[runlim] result:        2
[runlim] children:      3
[runlim] processes:     107
[runlim] real:          1.73 seconds
[runlim] time:          1.20 seconds
[runlim] space:         19 MB
[runlim] load:          0.09 maximum
[runlim] samples:       6
$ ps|grep kissat|grep -v grep
$ 
msoos commented 9 months ago

Ah, glad to hear it’s confirmed! Love the tool, accidentally stumbled into this while trying to hack it to show process names it’s still waiting for :) I might do a PR for the changes I made. Can be helpful to debug what processes are being waited for. Colleagues had some issues with bad process behaviour (nothing to do with runlim of course).

Thanks for looking into this and thank you for the tool of course,

Mate

On Wed 8. Nov 2023 at 15:10, Armin Biere @.***> wrote:

As well as the shell script kept on waiting (does not produce zombies):

$ cat waiting.sh

!/bin/sh

kissat test/cnf/hard.cnf -q & kissat test/cnf/hard.cnf -q & wait $ runlim -t 1 ./waiting.sh [runlim] version: 2.0.0rc9 [runlim] host: priel [runlim] time limit: 1 seconds [runlim] real time limit: 311040000 seconds [runlim] space limit: 64075 MB

[runlim] start: Wed Nov 8 15:09:55 2023 [runlim] child: 157424 [runlim] end: Wed Nov 8 15:09:57 2023 [runlim] status: out of time [runlim] result: 2 [runlim] children: 3 [runlim] processes: 107 [runlim] real: 1.73 seconds [runlim] time: 1.20 seconds [runlim] space: 19 MB [runlim] load: 0.09 maximum [runlim] samples: 6 $ ps|grep kissat|grep -v grep $

— Reply to this email directly, view it on GitHub https://github.com/arminbiere/runlim/issues/7#issuecomment-1801967021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKF4OILZI4LK32NDWKSUODYDOHGFAVCNFSM6AAAAAA7CCX2MOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBRHE3DOMBSGE . You are receiving this because you authored the thread.Message ID: @.***>

arminbiere commented 9 months ago

It is a general issue with zombie child processes.


$ cat one-zombie.sh 
#!/bin/sh
kissat test/cnf/hard.cnf -q &
$ runlim -t 1 ./one-zombie.sh 
[runlim] version:       2.0.0rc10
[runlim] host:          priel
[runlim] time limit:        1 seconds
[runlim] real time limit:   311040000 seconds
[runlim] space limit:       64075 MB
[runlim] argv[0]:       ./one-zombie.sh
[runlim] start:         Wed Nov  8 16:42:45 2023
[runlim] child:         159635
[runlim] end:           Wed Nov  8 16:42:45 2023
[runlim] status:        ok
[runlim] result:        0
[runlim] children:      0
[runlim] processes:     109
[runlim] real:          0.02 seconds
[runlim] time:          0.00 seconds
[runlim] space:         0 MB
[runlim] load:          1.21 maximum
[runlim] samples:       1
$ ps aux|grep kissat|grep -v grep
biere     159636  101  0.0  50516 29332 pts/4    R    16:42   0:22 kissat test/cnf/hard.cnf -q
$  killall kissat
$ 
arminbiere commented 9 months ago

Those zombies are attached to the system daemon which does not have the same process group as the original child process and all its children (the kissat zombies). I call those now root zombies and they are sampled and killed as the original child process. This prevents the (not killing) issue reported here but I am not sure it properly samples the actual resource usage of all the zombies too.

$ cat four-zombies-and-sleep.sh 
#!/bin/sh
kissat test/cnf/hard.cnf -q &
kissat test/cnf/hard.cnf -q &
kissat test/cnf/hard.cnf -q &
kissat test/cnf/hard.cnf -q &
sleep 1
$ runlim -d -t 2 -o /tmp/log ./four-zombies-and-sleep.sh 
$ tail -20 /tmp/log
[runlim] connect:       163997 -> 164002
[runlim] connected:     6 processes
[runlim] kill with SIGKILL :    163998
[runlim] kill with SIGKILL :    163999
[runlim] kill with SIGKILL :    164000
[runlim] kill with SIGKILL :    164001
[runlim] kill with SIGKILL :    164002
[runlim] kill with SIGKILL :    163997
[runlim] killed:        6 processes
[runlim] end:           Wed Nov  8 18:35:16 2023
[runlim] status:        out of time
[runlim] result:        2
[runlim] children:      6
[runlim] processes:     9
[runlim] real:          1.77 seconds
[runlim] time:          2.41 seconds
[runlim] space:         38 MB
[runlim] load:          0.58 maximum
[runlim] samples:       6
[runlim] reports:       6
$ ps aux|grep kissat|grep -v grep
$

This now became 2.0.0rc10.

arminbiere commented 9 months ago

Well the test case in the previous comment is pretty involved in terms what is really expected. The test script wants to return after roughly one second real time but it uses more than two seconds accumulated process time and thus is killed by the sampler from runlim. Then runlim will try to kill the other child processes too (the sleep and the four kissat). The killing routine has some delay and repeats. I do not think it accurately knows when it can stop. Therefore it is killing all the job related children in a loop with an exponential increasing delay (until 1000ms). This gives pretty odd wall clock time '1.77' and '2.41' accumulated process time (for all the parallel running processes).

Here is a use case which shows that at least when the script terminates itself (after slepping a second) all makes sense.

$ runlim ./four-zombies-and-sleep.sh 
[runlim] version:       2.0.0rc9
[runlim] host:          priel
[runlim] time limit:        311040000 seconds
[runlim] real time limit:   311040000 seconds
[runlim] space limit:       64075 MB
[runlim] argv[0]:       ./four-zombies-and-sleep.sh
[runlim] start:         Wed Nov  8 18:50:35 2023
[runlim] child:         164222
[runlim] end:           Wed Nov  8 18:50:37 2023
[runlim] status:        ok
[runlim] result:        0
[runlim] children:      6
[runlim] processes:     10
[runlim] real:          1.00 seconds
[runlim] time:          3.97 seconds
[runlim] space:         43 MB
[runlim] load:          0.81 maximum
[runlim] samples:       10
arminbiere commented 9 months ago

The fix for this issue now results in premature termination of a shell loop piped into another command.

While

$ while true; do runlim -t 1 kissat test/cnf/hard.cnf -q; done

runs forever, piping the output of this loop into cat (this is all bash)

$ while true; do runlim -t 1 kissat test/cnf/hard.cnf -q; done|cat

terminates (the loop and the pipe) after runlim returns (probably after killing the surounding pipe process too).

Same problem with

$ ( runlim -t 1 kissat test/cnf/hard.cnf -q; runlim -t 1 kissat test/cnf/hard.cnf -q ) |cat
msoos commented 9 months ago

Oh wow. I'm sorry I caused so much headache. It's maybe also reasonable to document that runlim does not work in such weird scenarios. Maybe there is no good solution :S Sorry again!

arminbiere commented 9 months ago

I think I finally figured out how to do that. Actually added some more test cases in test similar to those we discussed in this issue. Now manually testing them looks good (for those discussed here the old ones - particularly a.sh). I am running it in production mode for a while (on two clusters) and then would call it the 2.0.0 release. So this was really helpful!

msoos commented 9 months ago

Nice! Glad to hear I could help :)