When push-jobs-server is restarted, Sometime send_string(file : push-jobs-client / protocol_handler.rb, function : send_signed_message) pending without timeout.
Push Jobs Client Version
2.1.4
Platform Version
CentOS 7.2.1511
Replication Case
step 1. Install push job server and push job client.
step 2. start push-jobs-server
step 3. start push-jobs-client
step 4. check node status (available)
step 5. restart push-jobs-server
check node status (repeat step 5, sometimes the status does not change into available )
Stacktrace
log 1. Push-Jobs-Client send heartbeat (sequence #7)
log 2. When push-jobs-server is restarted.
incarnation ID changed
log 3. Push-Jobs-Client receive heartbeat (sequence #0),
Server restart is detected,
Reconfiguring after a randomly chosen 9.303089421512142 second delay to avoid storming the server.
log 4. Push-Jobs-Client try to send heartbeat (sequence #8)
But socket is invalid, send_string pending without timeout
/opt/push-jobs-client/embedded/lib/ruby/gems/2.1.0/gems/opscode-pushy-client-2.1.4/lib/pushy_client/protocol_handler.rb
def self.send_signed_message(socket, method, client_private_key, session_key, message)
auth = case method
when :rsa2048_sha1
make_header_rsa(message, client_private_key)
when :hmac_sha256
make_header_hmac(message, session_key)
end
===================================================================
log 1.
INFO: send_signed_json_command/@socket_lock.synchronize do #
INFO: self.send_signed_message start
INFO: self.send_signed_message/send_string auth
INFO: self.send_signed_message/send_string message : {"node":"CEPH-OSD-R03-03_606","client":"CEPH-OSD-R03-03","protocol_version":"2.0","org":"trosadmin-org","type":"heartbeat","sequence":7,"timestamp":"Mon, 27 Feb 2017 08:06:48 GMT","incarnation_id":"b7306b52-d70f-4b55-af9c-6426fbd8199c","job_state":"idle","job_id":null}
INFO: self.send_signed_message end
INFO: send_signed_json_command/@socket_lock.synchronize do end
log 2. Push-jobs-server is restarted
log 3.
INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:pH0yymNlQaIij9l54PFuradc/JMn1oIbJ2wqSDMTXc69OoBBuV0JIMrNprATU6PXmfQhtFL/81Qyg7obuGNBAd0irZIYwtPLjLUtNcjC4/f8+bpUJcpAIYoy79dzt3FgSW/FLkWwpfusUq8HwPHc715Y+t1Qu0Ydtv86fVpzhv0kogL6KJ7nUg5qGFvIs0ckQAKtOunQwXQgtJzVLhK45BH5gZVRjzgTC73znpQ9XQKOZXR2W1mcfLpA09mWDGIm8jeYbDVJm7mjZ8RF/OTwoeZDZswbYK/ogg19bkCSeLu2h64j+fBnQdw4fSMUyf7W/iQmFwTgMOHQnCk/p9d8hA==, message={"sequence":0,"timestamp":"Mon, 27 Feb 2017 08:06:50 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"}
INFO: [CEPH-OSD-R03-03_606] Server restart detected (incarnation ID changed from 4571d2e5-39f2-4413-a9be-96130d6900b1 to bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc). Reconfiguring after a randomly chosen 9.303089421512142 second delay to avoid storming the server ...
vi /opt/push-jobs-client/embedded/lib/ruby/gems/2.1.0/gems/opscode-pushy-client-2.1.4/lib/pushy_client/protocol_handler.rb
require 'timeout'
def self.send_signed_message(socket, method, client_private_key, session_key, message)
auth = case method
when :rsa2048_sha1
make_header_rsa(message, client_private_key)
when :hmac_sha256
make_header_hmac(message, session_key)
end
# add timeout
begin
timeout(30) do
socket.send_string(auth, ZMQ::SNDMORE)
socket.send_string(message)
end
rescue Timeout::Error
Chef::Log.info("Error : Timed out!")
end
end
@jbpark I ran into this same issue recently, and put in PR #143 to implement the fix. I have a build with the fix implemented if you're still getting bitten by this and are interested in testing it out.
Description
When push-jobs-server is restarted, Sometime send_string(file : push-jobs-client / protocol_handler.rb, function : send_signed_message) pending without timeout.
Push Jobs Client Version
2.1.4
Platform Version
CentOS 7.2.1511
Replication Case
step 1. Install push job server and push job client. step 2. start push-jobs-server step 3. start push-jobs-client step 4. check node status (available) step 5. restart push-jobs-server check node status (repeat step 5, sometimes the status does not change into available )
Stacktrace
log 1. Push-Jobs-Client send heartbeat (sequence #7) log 2. When push-jobs-server is restarted. incarnation ID changed log 3. Push-Jobs-Client receive heartbeat (sequence #0), Server restart is detected, Reconfiguring after a randomly chosen 9.303089421512142 second delay to avoid storming the server. log 4. Push-Jobs-Client try to send heartbeat (sequence #8) But socket is invalid, send_string pending without timeout
/opt/push-jobs-client/embedded/lib/ruby/gems/2.1.0/gems/opscode-pushy-client-2.1.4/lib/pushy_client/protocol_handler.rb def self.send_signed_message(socket, method, client_private_key, session_key, message) auth = case method when :rsa2048_sha1 make_header_rsa(message, client_private_key) when :hmac_sha256 make_header_hmac(message, session_key) end
log 5. trigger reconfigure Thread log 6. Push-Jobs-Client receive heartbeat (sequence #1) log 7. Push-Jobs-Client receive heartbeat (sequence #2)
=================================================================== log 1. INFO: send_signed_json_command/@socket_lock.synchronize do #
INFO: self.send_signed_message start
INFO: self.send_signed_message/send_string auth
INFO: self.send_signed_message/send_string message : {"node":"CEPH-OSD-R03-03_606","client":"CEPH-OSD-R03-03","protocol_version":"2.0","org":"trosadmin-org","type":"heartbeat","sequence":7,"timestamp":"Mon, 27 Feb 2017 08:06:48 GMT","incarnation_id":"b7306b52-d70f-4b55-af9c-6426fbd8199c","job_state":"idle","job_id":null}
INFO: self.send_signed_message end
INFO: send_signed_json_command/@socket_lock.synchronize do end
log 2. Push-jobs-server is restarted
log 3. INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:pH0yymNlQaIij9l54PFuradc/JMn1oIbJ2wqSDMTXc69OoBBuV0JIMrNprATU6PXmfQhtFL/81Qyg7obuGNBAd0irZIYwtPLjLUtNcjC4/f8+bpUJcpAIYoy79dzt3FgSW/FLkWwpfusUq8HwPHc715Y+t1Qu0Ydtv86fVpzhv0kogL6KJ7nUg5qGFvIs0ckQAKtOunQwXQgtJzVLhK45BH5gZVRjzgTC73znpQ9XQKOZXR2W1mcfLpA09mWDGIm8jeYbDVJm7mjZ8RF/OTwoeZDZswbYK/ogg19bkCSeLu2h64j+fBnQdw4fSMUyf7W/iQmFwTgMOHQnCk/p9d8hA==, message={"sequence":0,"timestamp":"Mon, 27 Feb 2017 08:06:50 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"} INFO: [CEPH-OSD-R03-03_606] Server restart detected (incarnation ID changed from 4571d2e5-39f2-4413-a9be-96130d6900b1 to bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc). Reconfiguring after a randomly chosen 9.303089421512142 second delay to avoid storming the server ...
log 4. INFO: heartbeat_thread/sleep 10.0 INFO: heartbeat_thread/client.send_heartbeat INFO: send_command/protocol_handler.send_heartbeat INFO: [CEPH-OSD-R03-03_606] Sending heartbeat (sequence #7) INFO: send_signed_json_command/@socket_lock.synchronize do #
INFO: self.send_signed_message start
INFO: self.send_signed_message/send_string auth -----------> send_string pending without timeout
log 5. INFO: trigger_reconfigure/Thread.new start
log 6. INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:vRXew7Lx/H9yfWYzHYGCG+tQ7tQTOEdKVZV7ZJkqOrqsZ8RhIYICrHM9vhAdbZ6MnvFSlNxgSpBzN67E+/tU+SUMlE2qLoPXS8PYaZMsJqG64uNrX2S7vyJWEPZgiFOMQ+BdVVnZ22KH40qCzgSCXP2jP2lLjz6KSEh4VR3UKq56aADb5UoR2DrYzGEkU+ndYDTNRtxeCv2a8VB2ZGP96MOHIfJm/wq3mdqatP3hCM6M0WWbfZe2AMc536KrdSW+ycqiX42dRX721+QQEEDiqJi6+852I3iEPB575WCaEfxHP6hXIK0Twj0APAe3J2Lg4QwmIIkfu6T7VwZZRig/0Q==, message={"sequence":1,"timestamp":"Mon, 27 Feb 2017 08:07:00 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"}
log 7. INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:dwv2dTO39DVc7fKOIRaNAaPkGOQ8e5YdKoqyBPsy5uGqJ6GrfC9fnp6CaUz4bu4x1c0c4vMb2Kev0BV/NMnfCMFi0LXF7fHKBm0XJ1DKtK2ih+MZWN79sfFeQ4ho0nYxSonmi8I5SNRzx1kVxlAm2MRfigKlB87kPaPEBcGQvWCtQtyCZlwHc5CTIj9mzlf121PkgC4dfm4V/YtKKNFK5s0Wdsf1gDWv0T6D4JNsytZVkbqKsWf0HAibHHHU5JBDmea2GGM7TgrdGApy7uy6MSMaQYnwUHGZbagoP1WXIPqfB61ohj49lgL5lc/7Oaq7Gxmec+HbfcXZQBdLCvTMTg==, message={"sequence":2,"timestamp":"Mon, 27 Feb 2017 08:07:10 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"}
Fix It
vi /opt/push-jobs-client/embedded/lib/ruby/gems/2.1.0/gems/opscode-pushy-client-2.1.4/lib/pushy_client/protocol_handler.rb
require 'timeout'
def self.send_signed_message(socket, method, client_private_key, session_key, message) auth = case method when :rsa2048_sha1 make_header_rsa(message, client_private_key) when :hmac_sha256 make_header_hmac(message, session_key) end
end