chef-boneyard / opscode-pushy-client

Client API for Pushy
Apache License 2.0
12 stars 14 forks source link

When push-jobs-server is restarted, sometimes push-jobs-client send_string pending without timeout #123

Open jbpark opened 7 years ago

jbpark commented 7 years ago

Description

When push-jobs-server is restarted, Sometime send_string(file : push-jobs-client / protocol_handler.rb, function : send_signed_message) pending without timeout.

Push Jobs Client Version

2.1.4

Platform Version

CentOS 7.2.1511

Replication Case

step 1. Install push job server and push job client. step 2. start push-jobs-server step 3. start push-jobs-client step 4. check node status (available) step 5. restart push-jobs-server check node status (repeat step 5, sometimes the status does not change into available )

Stacktrace

log 1. Push-Jobs-Client send heartbeat (sequence #7) log 2. When push-jobs-server is restarted. incarnation ID changed log 3. Push-Jobs-Client receive heartbeat (sequence #0), Server restart is detected, Reconfiguring after a randomly chosen 9.303089421512142 second delay to avoid storming the server. log 4. Push-Jobs-Client try to send heartbeat (sequence #8) But socket is invalid, send_string pending without timeout

/opt/push-jobs-client/embedded/lib/ruby/gems/2.1.0/gems/opscode-pushy-client-2.1.4/lib/pushy_client/protocol_handler.rb def self.send_signed_message(socket, method, client_private_key, session_key, message) auth = case method when :rsa2048_sha1 make_header_rsa(message, client_private_key) when :hmac_sha256 make_header_hmac(message, session_key) end

  socket.send_string(auth, ZMQ::SNDMORE) ----> pending
  socket.send_string(message)

log 5. trigger reconfigure Thread log 6. Push-Jobs-Client receive heartbeat (sequence #1) log 7. Push-Jobs-Client receive heartbeat (sequence #2)

=================================================================== log 1. INFO: send_signed_json_command/@socket_lock.synchronize do # INFO: self.send_signed_message start INFO: self.send_signed_message/send_string auth INFO: self.send_signed_message/send_string message : {"node":"CEPH-OSD-R03-03_606","client":"CEPH-OSD-R03-03","protocol_version":"2.0","org":"trosadmin-org","type":"heartbeat","sequence":7,"timestamp":"Mon, 27 Feb 2017 08:06:48 GMT","incarnation_id":"b7306b52-d70f-4b55-af9c-6426fbd8199c","job_state":"idle","job_id":null} INFO: self.send_signed_message end INFO: send_signed_json_command/@socket_lock.synchronize do end

log 2. Push-jobs-server is restarted

log 3. INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:pH0yymNlQaIij9l54PFuradc/JMn1oIbJ2wqSDMTXc69OoBBuV0JIMrNprATU6PXmfQhtFL/81Qyg7obuGNBAd0irZIYwtPLjLUtNcjC4/f8+bpUJcpAIYoy79dzt3FgSW/FLkWwpfusUq8HwPHc715Y+t1Qu0Ydtv86fVpzhv0kogL6KJ7nUg5qGFvIs0ckQAKtOunQwXQgtJzVLhK45BH5gZVRjzgTC73znpQ9XQKOZXR2W1mcfLpA09mWDGIm8jeYbDVJm7mjZ8RF/OTwoeZDZswbYK/ogg19bkCSeLu2h64j+fBnQdw4fSMUyf7W/iQmFwTgMOHQnCk/p9d8hA==, message={"sequence":0,"timestamp":"Mon, 27 Feb 2017 08:06:50 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"} INFO: [CEPH-OSD-R03-03_606] Server restart detected (incarnation ID changed from 4571d2e5-39f2-4413-a9be-96130d6900b1 to bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc). Reconfiguring after a randomly chosen 9.303089421512142 second delay to avoid storming the server ...

log 4. INFO: heartbeat_thread/sleep 10.0 INFO: heartbeat_thread/client.send_heartbeat INFO: send_command/protocol_handler.send_heartbeat INFO: [CEPH-OSD-R03-03_606] Sending heartbeat (sequence #7) INFO: send_signed_json_command/@socket_lock.synchronize do # INFO: self.send_signed_message start INFO: self.send_signed_message/send_string auth -----------> send_string pending without timeout

log 5. INFO: trigger_reconfigure/Thread.new start

log 6. INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:vRXew7Lx/H9yfWYzHYGCG+tQ7tQTOEdKVZV7ZJkqOrqsZ8RhIYICrHM9vhAdbZ6MnvFSlNxgSpBzN67E+/tU+SUMlE2qLoPXS8PYaZMsJqG64uNrX2S7vyJWEPZgiFOMQ+BdVVnZ22KH40qCzgSCXP2jP2lLjz6KSEh4VR3UKq56aADb5UoR2DrYzGEkU+ndYDTNRtxeCv2a8VB2ZGP96MOHIfJm/wq3mdqatP3hCM6M0WWbfZe2AMc536KrdSW+ycqiX42dRX721+QQEEDiqJi6+852I3iEPB575WCaEfxHP6hXIK0Twj0APAe3J2Lg4QwmIIkfu6T7VwZZRig/0Q==, message={"sequence":1,"timestamp":"Mon, 27 Feb 2017 08:07:00 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"}

log 7. INFO: handle_message : is_command=false, header=Version:2.0;SigningMethod:rsa2048_sha1;Signature:dwv2dTO39DVc7fKOIRaNAaPkGOQ8e5YdKoqyBPsy5uGqJ6GrfC9fnp6CaUz4bu4x1c0c4vMb2Kev0BV/NMnfCMFi0LXF7fHKBm0XJ1DKtK2ih+MZWN79sfFeQ4ho0nYxSonmi8I5SNRzx1kVxlAm2MRfigKlB87kPaPEBcGQvWCtQtyCZlwHc5CTIj9mzlf121PkgC4dfm4V/YtKKNFK5s0Wdsf1gDWv0T6D4JNsytZVkbqKsWf0HAibHHHU5JBDmea2GGM7TgrdGApy7uy6MSMaQYnwUHGZbagoP1WXIPqfB61ohj49lgL5lc/7Oaq7Gxmec+HbfcXZQBdLCvTMTg==, message={"sequence":2,"timestamp":"Mon, 27 Feb 2017 08:07:10 GMT","server":"ceph-mon-r03-01","type":"heartbeat","incarnation_id":"bd4ce5c5-ceaf-4fec-9e0e-7458667ba8fc"}

Fix It

vi /opt/push-jobs-client/embedded/lib/ruby/gems/2.1.0/gems/opscode-pushy-client-2.1.4/lib/pushy_client/protocol_handler.rb

require 'timeout'

def self.send_signed_message(socket, method, client_private_key, session_key, message) auth = case method when :rsa2048_sha1 make_header_rsa(message, client_private_key) when :hmac_sha256 make_header_hmac(message, session_key) end

  # add timeout
  begin
      timeout(30) do
          socket.send_string(auth, ZMQ::SNDMORE)
          socket.send_string(message)
      end
  rescue Timeout::Error
      Chef::Log.info("Error : Timed out!")
  end
end

end

nsdavidson commented 7 years ago

@jbpark I ran into this same issue recently, and put in PR #143 to implement the fix. I have a build with the fix implemented if you're still getting bitten by this and are interested in testing it out.