SeattleTestbed / nodemanager

Remote control server for SeattleTestbed nodes
MIT License
0 stars 10 forks source link

Seash upload timeouts #81

Open choksi81 opened 10 years ago

choksi81 commented 10 years ago

Seash sometimes times out on uploads. We've done a number of packet traces on the host running seash, and here is what we've found out:

You can categorize three types of errors:

  1. signedcommunicate failed on session_recvmessage with error 'recv() timed out!' usually means the file was fully uploaded, but the node failed to return "Success" in time.
  2. signedcommunicate failed on session_recvmessage with error 'send() timed out!' means the file was not fully transferred.
  3. signedcommunicate failed on session_sendmessage with error 'Socket closed' is adressed in #1009 -- I didn't see this one on uploads so far, only when browsing for nodes.

The recv() issue was tackled in #971 and thought to be solved by speeding up the crypto parts of the communication between node manager and seash (parts of which remain to be improved, see #990). The packet traces show a typical string of events leading to this error:

I've also seen "show files" fail with a recv() timeout, but this is rare. Surprisingly, the timeout is 10 seconds there.

The send() issue manifests like this:

Both recv() and send() issues happen even if I increase seash's timeout (see #892). I tried 90 seconds, but would have had to use 300 or so for the slowest nodes.

I checked with CoMoN -- all of the nodes that produced errors had a high load average, and those with the highest numbers had the most persistent ones. This might hint at how to reproduce the problem on a node where we can locally trace node manager packets.

choksi81 commented 10 years ago

Attachments: https://github.com/SeattleTestbed/attic/blob/master/TICKET_ATTACHMENTS/seash_set_maxthreads.diff