Allow setting timeout to "never" for a job running on an interval.

eliasbrange commented 7 years ago

We have multiple running jobs that runs every 15-30 seconds. In very rare cases, these jobs gets stuck. We do not want the jobs to be terminated automatically due to the uncertainty of the job states. Right now we set a huge timeout that will never be reached, and it would instead be nice to be able to set never.

One example is, every 30 seconds we poll a S3 bucket for .bak files and restore them if there are any. If the restore fails and gets stuck somehow, we do not want the job to start over until we have fixed the problem.

jwreagor commented 7 years ago

We really appreciate your feedback @eliasbrange. I want to make sure we talk this through to figure out what your needs/options are.

AFAIK most, if not all, timeout configuration in ContainerPilot is optional. Not including a timeout should dictate the never functionality you're requesting.

Another solution is to bake the exception/error handling into the process the job is executing. ContainerPilot has timeout features in order to handle it's own performance but any application it runs should subsequently handle it's own processing. Which leads me to want to understand your setup more closely.

It would be helpful if you could post a test configuration (or test repo I can run myself) so we can reference a working solution in this issue. Are you using version 2.x or 3.x?

eliasbrange commented 7 years ago

Thanks for looking into it. I will try to be more clear:

According to the documentation the timeout defaults to the when:interval value. Our use case is that we run a job on an interval every 30 seconds, however some executions of that job might take longer than 30 seconds which will lead to a timeout if we do not explicitly set it to a higher value. Being able to tell the job to never timeout, even when running on an interval is what we are looking for.

I can post a configuration tomorrow, it's kinda late here now!

Using version 3.4.2

jwreagor commented 7 years ago

Ok, I was able to confirm the behavior after your last description. I've built a simple example setup that demonstrates the kill after the default timeout hits.

https://gist.github.com/cheapRoc/fb5a81ff590e898c5561d1851c827d9e

Here's the log output showing the process kill when a sleep occurs longer than the set interval amount.

time="2017-11-09T22:34:25Z" level=debug msg="control: initialized router for control server"
time="2017-11-09T22:34:25Z" level=debug msg="control: listening to /tmp/cp-single.socket"
time="2017-11-09T22:34:25Z" level=debug msg="event: {Startup global}"
time="2017-11-09T22:34:25Z" level=debug msg="test-repeats.Run start"
time="2017-11-09T22:34:25Z" level=info msg="control: serving at /tmp/cp-single.socket"
time="2017-11-09T22:34:25Z" level=info msg="Going to sleep for 24s" job=test-repeats pid=15
time="2017-11-09T22:34:25Z" level=info msg="Thu, 09 Nov 2017 22:34:25 +0000" job=test-repeats pid=15
time="2017-11-09T22:34:35Z" level=warning msg="test-repeats timeout after 10s: '[]'"
time="2017-11-09T22:34:35Z" level=debug msg=test-repeats.kill
time="2017-11-09T22:34:35Z" level=debug msg="killing command 'test-repeats' at pid: 15"
time="2017-11-09T22:34:35Z" level=debug msg="test-repeats.Run start"
time="2017-11-09T22:34:35Z" level=error msg="test-repeats exited with error: signal: killed"
time="2017-11-09T22:34:35Z" level=debug msg="event: {ExitFailed test-repeats}"
time="2017-11-09T22:34:35Z" level=debug msg="event: {Error test-repeats: signal: killed}"
time="2017-11-09T22:34:35Z" level=debug msg="test-repeats.Run end"
time="2017-11-09T22:34:35Z" level=info msg="Going to sleep for 2s" job=test-repeats pid=22
time="2017-11-09T22:34:35Z" level=info msg="Thu, 09 Nov 2017 22:34:35 +0000" job=test-repeats pid=22
time="2017-11-09T22:34:37Z" level=info msg="Thu, 09 Nov 2017 22:34:37 +0000" job=test-repeats pid=22
time="2017-11-09T22:34:37Z" level=debug msg="test-repeats exited without error"
time="2017-11-09T22:34:37Z" level=debug msg="event: {ExitSuccess test-repeats}"
time="2017-11-09T22:34:37Z" level=debug msg="test-repeats.Run end"
time="2017-11-09T22:34:37Z" level=debug msg=test-repeats.term
time="2017-11-09T22:34:37Z" level=debug msg="terminating command 'test-repeats' at pid: 22"

You can clearly see with a default timeout of 10s and a sleep of 24s ContainerPilot sends a SIGKILL in order to run the next iteration of the job.

You can see the inverse as well, where a timeout of 10s and a sleep of 2s returns exited without error.

I'm suspect of what can be done given that this is a somewhat routine action. But it's possible it's a missed feature and I'll dig into the implementation next to see what can be done.

eliasbrange commented 7 years ago

It would be great to be able to set the timeout to never. Right now we set it to something insane like 1000000h, which basically is the same thing.

Also, doesn't it send a SIGKILL and not a SIGTERM?

jwreagor commented 6 years ago

Bad choice of words. You're correct, it's a SIGKILL when the timeout hits and a SIGTERM when shutting down ContainerPilot entirely.

jwreagor commented 6 years ago

I thought this deserved an update. I've added the new behavior in a local branch but I'm also seeing a sporadic race condition that I'm still trying to understand. It looks like negating the timeout is causing global shutdown to block and jobs continue to run. This is something I've seen before so I'm starting to understand some of the problem, but not all of it. Still investigating.

eliasbrange commented 6 years ago

Thanks for looking into it @cheapRoc

jwreagor commented 6 years ago

Now that the race condition seems to have subsided I can more easily describe why this might not be as easy as it looks.

First the good news, implementing the configuration change was simple. Setting a job's timeout to never now negates the default timeout setting equal to the when.interval. This gives the desired effect when allowing jobs to run through their timeout.

On the flip side, the change I've described still publishes jobs to run regardless of whether or not they're already running. This happens because the timeout is removed which previously "certified" serial executions of a job. ContainerPilot sets up a timer to run jobs at a specified when.interval. Currently without the timeout ContainerPilot has no way of pausing that timer and preventing subsequent jobs from running.

This lead to a race condition in another area and I'll open a patch to address that. I'll keep testing but I can state that it hasn't happened since I made those changes, which is semi-good news.

So this won't be a quick fix and I'll have to punt on it for the moment to address other issues.

@eliasbrange Have you tried running your job with restarts instead of when.interval? Any reason why that wouldn't work? You should be able to wrap that job in a shell script and stick a sleep(1) in the beginning to simulate the interval setting.

eliasbrange commented 6 years ago

We thought about wrapping our scripts with a sleep, however felt that using the timeout configuration rather than restarts were more explicit. Guess both workarounds works for now.

jwreagor commented 6 years ago

I'm going to keep this issue open because it would be beneficial to optimize how we're tracking running jobs. Plus, there are other issues that involve changing timers and timeouts so I'm hopeful of returning to this in due time.

TritonDataCenter / containerpilot

Allow setting timeout to "never" for a job running on an interval. #525