nsdavidson commented 7 years ago

Signed-off-by: Nolan Davidson ndavidson@chef.io

Description

As noted in #123, the push jobs client will sometimes hang on the send_string call when the push jobs server is restarted (or is replaced by another PJ server). After adding this timeout, I have been able to successfully restart and replace the PJ server without the client hanging consistently.

Issues Resolved

123

Check List

[ ] New functionality includes tests
[ ] All tests pass
[ ] CHANGELOG.md has been updated
[ ] All commits have been signed-off for the Developer Certificate of Origin. See https://github.com/chef/chef/blob/master/CONTRIBUTING.md#developer-certification-of-origin-dco

irvingpop commented 7 years ago

howdy @nsdavidson could you please push an ad-hoc build through CI and place it somewhere externally consumable to enable testing?

nsdavidson commented 7 years ago

If anyone is interested in testing this out, a RHEL 6 pushy-client build can be found here: https://s3-us-west-2.amazonaws.com/sce-pub/push-jobs-client-2.4.1%2B20170912123217-1.el6.x86_64.rpm

I can share other platform builds if needed as well.

btm commented 7 years ago

Thanks for digging into this! Unfortunately it looks like we have minimal unit tests already, so we can forgo adding those for this fix. A couple comments to look at and then this will be good.

jittipa commented 7 years ago

Installed the rpm for this new push-jobs client on two nodes running RHEL6. When the push-jobs server was just restarted - chef-server-ctl restart opscode-pushy-server, both nodes checked back in properly, i.e., knife node status showed both nodes as available. Unfortunately, if I kept the server down for a certain period of time and restarted, one of the two nodes did not.

nsdavidson commented 7 years ago

Thanks for the feedback @btm! Requested changes made.

@jittipa Hrmm...this fix is specifically for the case of the server being unavailable for more than a few seconds. Would it be possible for you to grab the push-client logs off the node that's not recovering?

jittipa commented 7 years ago

@nsdavidson After almost 1 hour (15:21:51 - 16:17:30), the client finally checked back in and is now available.

2017-09-13_16:17:30.15059 INFO: [jobsserver] Config is now 3586.204102130747 seconds old. Reconfiguring / reloading keys ...

lcc2207 commented 7 years ago

@nsdavidson can you publish an RHEL 7 version or Amazon linux version?

nsdavidson commented 7 years ago

@lcc2207 sure thing: https://s3-us-west-2.amazonaws.com/sce-pub/push-jobs-client-2.4.1%2B20170912123217-1.el7.x86_64.rpm

irvingpop commented 7 years ago

Amazon Linux is usually good with the RHEL6 packages

btm commented 7 years ago

And what to do with the message? Retry after a restart? I had looked at something like this earlier as a pattern but got distracted.

http://zguide.zeromq.org/rb:lpclient

nsdavidson commented 7 years ago

@btm @markan I rolled in Mark's changes from PR #144 with some tweaks.

In addition to scheduling a reconfigure when the ZMQ socket times out, I also added that logic to a failed config download. I noticed that if we try to download a config, and it fails after the standard 5 retries, the client would get into that wedged state waiting for it's next 3600 second reconfigure.

I tested this change and it is handling both short periods of the push-job server being offline and a complete server teardown/rebuild consistently.

/cc @irvingpop

irvingpop commented 7 years ago

:shipit: !

chef-boneyard / opscode-pushy-client

Wrap ZMQ request in timeout #143

Description

Issues Resolved

123

Check List