arminbiere / runlim

Other
12 stars 4 forks source link

Configure delay between SIGTERM and SIGKILL #1

Closed rkaminsk closed 2 years ago

rkaminsk commented 2 years ago

We have been using the runsolver tool in the past and would like to switch to runlim due to issues with the tool. We were using one particular feature of the runsolver that allows for setting a delay sending the SIGTERM and SIGKILL signals in case the system takes a little longer to shut down. As far as I can see, the delay between the two signals in runlim is hard coded.

Would it be possible to provide an option to configure this delay. So that after say n seconds a SIGTERM is send and after n+delay seconds a SIGKILL is sent?

arminbiere commented 2 years ago

Just added a '--kill-delay=' option. Default is 512 (miiliseconds), which kind of adds a delay of 1 second (as it exponentially goes down to 1 before really killing the solver). This adds this one second to the wall-clock / real-time used, which might be an issue if you are not careful.

rkaminsk commented 2 years ago

Thanks, we'll give it a try!

rkaminsk commented 2 years ago

Just head a quick look at the code. Do I understand it right that if the configured delay is too large no KILL signal is send because the number of rounds is limited to 9? It looks a bit like the 9 rounds are connected to the previous value 512 = 2^9.

arminbiere commented 2 years ago

I also had a look at the logic I implemented some time ago. The idea was to kill with SIGTERM close to the deadline and before with SIGKILL, but it seems I got the comparison wrong, and it does it the opposite way. I will change the check and then push a new release candidate.

arminbiere commented 2 years ago

So I changed the order of the test but on my laptop it looks a bit fishy (the way it now behaves). I have to test it on the cluster later.

rkaminsk commented 2 years ago

Hello, are you sure it was not right, before? Did you overlook the *1000? The tool was working after all.

# run assuming the command does not react to a sigterm
ms = 512*1000

rounds = 0
ms > 2000: TERM
ms /= 2: ms = 256,000

rounds = 1
ms > 2000: TERM
ms /= 2: ms = 128,000

...

rounds = 7
ms > 2000: TERM
ms /= 2: ms = 2,000

rounds = 8
ms > 2000: TERM
ms /= 2: ms = 1,000

rounds = 9
ms > 2000: KILL
ms /= 2: ms = 500
# break in next round
# because there should be no more processes after a KILL                                                                                                                                                                                                                            

rounds = 10
ms > 2000: KILL
ms /= 2: ms = 250
# break because of rounds

To make the delay option work? Maybe simply terminate after the KILL signal was used?

if (killer == kill_process) break;

The only difference to the code before would be if some process somehow escapes being killed in the first round because there would be one more round.

arminbiere commented 2 years ago

Thanks for checking. It really seems that I got it right the first time, but that commit was wrong and I reverted this incorrect 'fix'. Thank you also for the suggestion in using 'killer = kill_process'. I use the logic instead that at 2000 ms and above I use 'SIGTERM' and below 'SIGKILL' and further stop the killing at 1000 ms. If one wants to have more 'SIGKILL' signals before giving up killing, then one could simply set the '2000' to something larger, say 4000.

rkaminsk commented 2 years ago

Thanks! This will work for us.

PS: Just cosmetics. The counter for the rounds could be removed.