Optional SIGTERM at Job Failure

Xtigyro commented 4 years ago

I would like to THANK YOU for this brilliant piece of software!

I have a use case in which supercronic acts as the entry-point of a container. That leads to the necessity to restart the container if the job that is run by supercronic fails.

krallin commented 4 years ago

Can you clarify your use case a little bit, and explain what you're using Supercronic for here? I'm a little confused by exactly what you're trying to achieve. Thanks!

Xtigyro commented 4 years ago

@krallin At Puppetlabs we're about to include supercronic in the official r10k Docker image. We need it so we can run r10k as a sidecar container in K8s using our Puppet Server Helm chart.

When the run of r10k fails - we need the container itself to declare in its run status that there's a problem. It would be handy if it marks its run as failed - so that K8s will restart it, thus making clear that there was a problem with the run of r10k.

Right now we work around this through (https://github.com/Xtigyro/puppetserver-helm-chart/blob/edge/templates/r10k-code.configmap.yaml#L25):

    exec /docker-entrypoint.sh deploy environment --config /etc/puppetlabs/puppet/r10k_code.yaml \
    --puppetfile {{ template "r10k.code.args" . }} > ~/.r10k_code_cronjob.out 2>&1
    retVal=$?
    if [ "$retVal" -eq "0" ]; then
      touch ~/.r10k_code_cronjob.success > /dev/null 2>&1
    else
      rm ~/.r10k_code_cronjob.success > /dev/null 2>&1
    fi
    exit $retVal

CC: @underscorgan @slconley

krallin commented 4 years ago

So, you're saying you want to run r10k on a periodic basis, and you want to the container to exit if it ever fails. Is that correct?

Xtigyro commented 4 years ago

Yes, correct. Thanks for your time, BTW!

Xtigyro commented 4 years ago

@krallin Another option can be supercronic to make externally visible the status of the scheduled jobs - and I mean not only by logging info in stdout.

If you set readiness/liveness K8s probes for the container that supercronic is always running in - you can check the state of the jobs (just examples):

from a supercronic local log file that matches the name of the job.
by running supercronic status [name_of_your_job].
through an HTTP API of supercronic where the status of the scheduled jobs is available.

CC: @underscorgan @slconley @iristyle

hartwork commented 3 years ago

The SIGTERM feature that is being asked for in the first post can already by emulated by wrapping the command in the crontab line by bash -c "original command here || kill -15 $PPID" where $PPID is the process ID of supercronic as provided by the shell. For a demo:

# ./supercronic -debug <(echo '* * * * * * * bash -c "false || kill -15 $PPID"')
INFO[2020-11-14T15:36:12+01:00] read crontab: /dev/fd/63                     
DEBU[2020-11-14T15:36:12+01:00] try parse(7): * * * * * * * bash -c "false || kill -15 $PPID"[0:13] = * * * * * * * 
DEBU[2020-11-14T15:36:12+01:00] job will run next at 2020-11-14 15:36:13 +0100 CET  job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] starting                                      iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] received terminated, shutting down           
INFO[2020-11-14T15:36:13+01:00] waiting for jobs to finish                   
INFO[2020-11-14T15:36:13+01:00] job succeeded                                 iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] job will run next at 2020-11-14 15:36:14 +0100 CET  job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] shutting down                                 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] exiting

Would that be too brittle?

On bird eye level, terminating supercronic on failures seems to be intended as a mean of signaling to the outer world. From a monitoring point of view, rather than waiting for someone to signal "I'm dead", checking for repeated "I'm alive" would probably make more robust monitoring. So if the context is monitoring, either the existing Prometheus HTTP endpoint or a new HTTP status endpoint would serve the bigger picture best, probably. What do you think?

hartwork commented 3 years ago

@Xtigyro any thoughts?

Xtigyro commented 3 years ago

@hartwork It's required so Kubernetes can monitor the Liveness/Readiness probes. A new HTTP status endpoint will be perfect!

hartwork commented 3 years ago

@Xtigyro what about the bash wrapper and the existing Prometheus endpoint?

Xtigyro commented 3 years ago

@hartwork The BASH wrapper is a bit dirty but can be used. The Prom endpoint cannot work with a vanilla K8s cluster.

krallin commented 3 years ago

@Xtigyro : can you clarify what you mean by:

The Prom endpoint cannot work with a vanilla K8s cluster.

I'm not sure I follow what cannot work there. FWIW, I do think exposing a status API would be reasonable to do, but I also would like to make sure that whatever does not work with the Prometheus endpoint would work with a status API :)

hartwork commented 3 years ago

The Prom endpoint cannot work with a vanilla K8s cluster.

I think he refers to Kubernetes liveness HTTP requests. Kubernetes looks at HTTP status codes and considers 200 <= x < 400 as success. It doesn't seem to support a check for absence of word supercronic_failed_executions in the HTTP response body. @Xtigyro could you confirm or correct me on this?

It might be possible to build a Kubernetes liveness command to do the same, based on nothing but Bash, wget and supercronic's existing Prometheus metrics endpoint. I imagine something like this to work well in practice:

bash -c '( set -o pipefail ; wget -O- http://localhost:9746/metrics 2>/dev/null | fgrep supercronic_ | { ! grep -q failed_executions ; } )'

What do you think?

Xtigyro commented 3 years ago

@hartwork @krallin Yup - spot on - that was exactly my point.

I'm gonna test that command and let you know - it looks alright to me, should be useful.

On a side note - in the Puppet Server Helm chart we've worked around it like this: https://github.com/puppetlabs/puppetserver-helm-chart/blob/master/templates/r10k-code.configmap.yaml#L20_L40

CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab

Xtigyro commented 3 years ago

http://localhost:9746/metrics

@hartwork A question - is this exposed by default?

~ $ wget -O- http://localhost:9746/metrics
Connecting to localhost:9746 (127.0.0.1:9746)
wget: can't connect to remote host (127.0.0.1): Connection refused

hartwork commented 3 years ago

@hartwork A question - is this exposed by default?

It needs something like supercronic -prometheus-listen-address localhost:9746 [..] but that's all. The :9746 can be omitted if your copy of supercronic is fresh enough to include PR #81.

Xtigyro commented 3 years ago

Guys, replacing one Shell command with another doesn't make enough of a difference for us.

One of these might improve the current situation:

from a supercronic local log file that matches the name of the job.
by running supercronic status [name_of_your_job].
through an HTTP API of supercronic where the status of the scheduled jobs is available.

If it's too time-consuming any of those to be developed - we will completely understand and we should just close the feature request.

CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab

hartwork commented 3 years ago

Guys, replacing one Shell command with another doesn't make enough of a difference for us.

@Xtigyro could you elaborate? You asked for a liveness check for vanilla kubernetes and it seems like the command above would satisfy that request. If it's not the actual set of requirements, please share the actual current set of requirements. Thank you.

Xtigyro commented 3 years ago

The current workaround we use is essentially a much simpler Shell command which is also used in a different way - so it's more resource efficient. We were looking for a real status health check compatible with K8s probes.

The current situation is good enough anyway - so if such per job status health check won't be beneficial in general for supercronic - then we better just leave it the way it is. It's still a great piece of software which we're gonna keep using.

All good, in other words. 🙇‍♂️

aptible / supercronic

Optional SIGTERM at Job Failure #60