Closed nsdavidson closed 7 years ago
howdy @nsdavidson could you please push an ad-hoc build through CI and place it somewhere externally consumable to enable testing?
If anyone is interested in testing this out, a RHEL 6 pushy-client build can be found here: https://s3-us-west-2.amazonaws.com/sce-pub/push-jobs-client-2.4.1%2B20170912123217-1.el6.x86_64.rpm
I can share other platform builds if needed as well.
Thanks for digging into this! Unfortunately it looks like we have minimal unit tests already, so we can forgo adding those for this fix. A couple comments to look at and then this will be good.
Installed the rpm for this new push-jobs client on two nodes running RHEL6. When the push-jobs server was just restarted - chef-server-ctl restart opscode-pushy-server, both nodes checked back in properly, i.e., knife node status showed both nodes as available. Unfortunately, if I kept the server down for a certain period of time and restarted, one of the two nodes did not.
Thanks for the feedback @btm! Requested changes made.
@jittipa Hrmm...this fix is specifically for the case of the server being unavailable for more than a few seconds. Would it be possible for you to grab the push-client logs off the node that's not recovering?
@nsdavidson After almost 1 hour (15:21:51 - 16:17:30), the client finally checked back in and is now available.
2017-09-13_16:17:30.15059 INFO: [jobsserver] Config is now 3586.204102130747 seconds old. Reconfiguring / reloading keys ...
@nsdavidson can you publish an RHEL 7 version or Amazon linux version?
Amazon Linux is usually good with the RHEL6 packages
And what to do with the message? Retry after a restart? I had looked at something like this earlier as a pattern but got distracted.
@btm @markan I rolled in Mark's changes from PR #144 with some tweaks.
In addition to scheduling a reconfigure when the ZMQ socket times out, I also added that logic to a failed config download. I noticed that if we try to download a config, and it fails after the standard 5 retries, the client would get into that wedged state waiting for it's next 3600 second reconfigure.
I tested this change and it is handling both short periods of the push-job server being offline and a complete server teardown/rebuild consistently.
/cc @irvingpop
:shipit: !
Signed-off-by: Nolan Davidson ndavidson@chef.io
Description
As noted in #123, the push jobs client will sometimes hang on the send_string call when the push jobs server is restarted (or is replaced by another PJ server). After adding this timeout, I have been able to successfully restart and replace the PJ server without the client hanging consistently.
Issues Resolved
123
Check List