Open Xtigyro opened 4 years ago
Can you clarify your use case a little bit, and explain what you're using Supercronic for here? I'm a little confused by exactly what you're trying to achieve. Thanks!
@krallin At Puppetlabs we're about to include supercronic
in the official r10k
Docker image. We need it so we can run r10k
as a sidecar container in K8s using our Puppet Server Helm chart.
When the run of r10k
fails - we need the container itself to declare in its run status that there's a problem. It would be handy if it marks its run as failed
- so that K8s will restart it, thus making clear that there was a problem with the run of r10k
.
Right now we work around this through (https://github.com/Xtigyro/puppetserver-helm-chart/blob/edge/templates/r10k-code.configmap.yaml#L25):
exec /docker-entrypoint.sh deploy environment --config /etc/puppetlabs/puppet/r10k_code.yaml \
--puppetfile {{ template "r10k.code.args" . }} > ~/.r10k_code_cronjob.out 2>&1
retVal=$?
if [ "$retVal" -eq "0" ]; then
touch ~/.r10k_code_cronjob.success > /dev/null 2>&1
else
rm ~/.r10k_code_cronjob.success > /dev/null 2>&1
fi
exit $retVal
CC: @underscorgan @slconley
So, you're saying you want to run r10k on a periodic basis, and you want to the container to exit if it ever fails. Is that correct?
Yes, correct. Thanks for your time, BTW!
@krallin Another option can be supercronic
to make externally visible the status of the scheduled jobs - and I mean not only by logging info in stdout
.
If you set readiness/liveness K8s probes for the container that supercronic
is always running in - you can check the state of the jobs (just examples):
supercronic
local log file that matches the name of the job.supercronic status [name_of_your_job]
.supercronic
where the status of the scheduled jobs is available.CC: @underscorgan @slconley @iristyle
The SIGTERM feature that is being asked for in the first post can already by emulated by wrapping the command in the crontab line by bash -c "original command here || kill -15 $PPID"
where $PPID
is the process ID of supercronic as provided by the shell. For a demo:
# ./supercronic -debug <(echo '* * * * * * * bash -c "false || kill -15 $PPID"')
INFO[2020-11-14T15:36:12+01:00] read crontab: /dev/fd/63
DEBU[2020-11-14T15:36:12+01:00] try parse(7): * * * * * * * bash -c "false || kill -15 $PPID"[0:13] = * * * * * * *
DEBU[2020-11-14T15:36:12+01:00] job will run next at 2020-11-14 15:36:13 +0100 CET job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] starting iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] received terminated, shutting down
INFO[2020-11-14T15:36:13+01:00] waiting for jobs to finish
INFO[2020-11-14T15:36:13+01:00] job succeeded iteration=0 job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] job will run next at 2020-11-14 15:36:14 +0100 CET job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
DEBU[2020-11-14T15:36:13+01:00] shutting down job.command="bash -c \"false || kill -15 $PPID\"" job.position=0 job.schedule="* * * * * * *"
INFO[2020-11-14T15:36:13+01:00] exiting
Would that be too brittle?
On bird eye level, terminating supercronic on failures seems to be intended as a mean of signaling to the outer world. From a monitoring point of view, rather than waiting for someone to signal "I'm dead", checking for repeated "I'm alive" would probably make more robust monitoring. So if the context is monitoring, either the existing Prometheus HTTP endpoint or a new HTTP status endpoint would serve the bigger picture best, probably. What do you think?
@Xtigyro any thoughts?
@hartwork It's required so Kubernetes can monitor the Liveness/Readiness probes. A new HTTP status endpoint will be perfect!
@Xtigyro what about the bash wrapper and the existing Prometheus endpoint?
@hartwork The BASH wrapper is a bit dirty but can be used. The Prom endpoint cannot work with a vanilla K8s cluster.
@Xtigyro : can you clarify what you mean by:
The Prom endpoint cannot work with a vanilla K8s cluster.
I'm not sure I follow what cannot work there. FWIW, I do think exposing a status API would be reasonable to do, but I also would like to make sure that whatever does not work with the Prometheus endpoint would work with a status API :)
The Prom endpoint cannot work with a vanilla K8s cluster.
I think he refers to Kubernetes liveness HTTP requests. Kubernetes looks at HTTP status codes and considers 200 <= x < 400
as success. It doesn't seem to support a check for absence of word supercronic_failed_executions
in the HTTP response body. @Xtigyro could you confirm or correct me on this?
It might be possible to build a Kubernetes liveness command to do the same, based on nothing but Bash, wget and supercronic's existing Prometheus metrics endpoint. I imagine something like this to work well in practice:
bash -c '( set -o pipefail ; wget -O- http://localhost:9746/metrics 2>/dev/null | fgrep supercronic_ | { ! grep -q failed_executions ; } )'
What do you think?
@hartwork @krallin Yup - spot on - that was exactly my point.
I'm gonna test that command and let you know - it looks alright to me, should be useful.
On a side note - in the Puppet Server Helm chart we've worked around it like this: https://github.com/puppetlabs/puppetserver-helm-chart/blob/master/templates/r10k-code.configmap.yaml#L20_L40
CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab
http://localhost:9746/metrics
@hartwork A question - is this exposed by default?
~ $ wget -O- http://localhost:9746/metrics
Connecting to localhost:9746 (127.0.0.1:9746)
wget: can't connect to remote host (127.0.0.1): Connection refused
@hartwork A question - is this exposed by default?
It needs something like supercronic -prometheus-listen-address localhost:9746 [..]
but that's all. The :9746
can be omitted if your copy of supercronic is fresh enough to include PR #81.
Guys, replacing one Shell command with another doesn't make enough of a difference for us.
One of these might improve the current situation:
If it's too time-consuming any of those to be developed - we will completely understand and we should just close the feature request.
CC: @underscorgan @slconley @Iristyle @scottcressi @mwaggett @nwolfe @adrienthebo @dhollinger @binford2k @alannab
Guys, replacing one Shell command with another doesn't make enough of a difference for us.
@Xtigyro could you elaborate? You asked for a liveness check for vanilla kubernetes and it seems like the command above would satisfy that request. If it's not the actual set of requirements, please share the actual current set of requirements. Thank you.
The current workaround we use is essentially a much simpler Shell command which is also used in a different way - so it's more resource efficient. We were looking for a real status health check compatible with K8s probes.
The current situation is good enough anyway - so if such per job status health check won't be beneficial in general for supercronic
- then we better just leave it the way it is. It's still a great piece of software which we're gonna keep using.
All good, in other words. 🙇♂️
I would like to THANK YOU for this brilliant piece of software!
I have a use case in which
supercronic
acts as the entry-point of a container. That leads to the necessity to restart the container if the job that is run bysupercronic
fails.