SeattleTestbed / seash

Interactive vessel management tool
MIT License
0 stars 10 forks source link

Seash upload timeouts #63

Open choksi81 opened 10 years ago

choksi81 commented 10 years ago

Seash sometimes times out on uploads. We've done a number of packet traces on the host running seash, and here is what we've found out:

You can categorize three types of errors:

  1. signedcommunicate failed on session_recvmessage with error 'recv() timed out!' usually means the file was fully uploaded, but the node failed to return "Success" in time.
  2. signedcommunicate failed on session_recvmessage with error 'send() timed out!' means the file was not fully transferred.
  3. signedcommunicate failed on session_sendmessage with error 'Socket closed' is adressed in #1009 -- I didn't see this one on uploads so far, only when browsing for nodes.

The recv() issue was tackled in #971 and thought to be solved by speeding up the crypto parts of the communication between node manager and seash (parts of which remain to be improved, see #990). The packet traces show a typical string of events leading to this error:

I've also seen "show files" fail with a recv() timeout, but this is rare. Surprisingly, the timeout is 10 seconds there.

The send() issue manifests like this:

Both recv() and send() issues happen even if I increase seash's timeout (see #892). I tried 90 seconds, but would have had to use 300 or so for the slowest nodes.

I checked with CoMoN -- all of the nodes that produced errors had a high load average, and those with the highest numbers had the most persistent ones. This might hint at how to reproduce the problem on a node where we can locally trace node manager packets.

choksi81 commented 10 years ago

See also nodemanager's issue #81

choksi81 commented 10 years ago

r4504 seems to solve the "recv() timed out" problem. The opens the connection from seash to node only after the data to be sent is signed, and thus keeps then connection from idling (which raised the chance of a timeout happening). The send() issue is hard to get around, it's a problem of the node being too slow to respond in time because it's under heavy load. Load may come from other processes (think PlanetLab?), but also from seash itself, as it could be communicating with multiple vessels on the same node. This is especially problematic for low-end devices. We've tried to reduce the number of worker threads that contact vessels in parallel, and this helps in the case of multiple vessels on one node, but not for single vessels on nodes that are slow in general. We could improve seash to never contact vessels on one node in parallel by accordingly sorting/splitting the list of vessels, but I don't think this is a common scenario. It were if more people used customized installers, SeattleGENI wouldn't return vessels on different nodes all the time, etc. I suggest we close this ticket -- WONTFIX right now.

choksi81 commented 10 years ago

Here's a suggestion for a possible workaround: Implement a set maxparallelism command in seash similar to the set uploadrate and set timeout commands of #892 / r4035. This would set the number of parallel worker threads seash spawns, and could thus either speed up operations if you have large groups of distinct nodes, or let you work more slowly on groups where uploads fail due to contention timeouts. Changes are required in source:seattle/trunk/seash/seash_helper.py and its usage of MAX_CONTACT_WORKER_THREAD_COUNT, as well as in source:seattle/trunk/seash/seash_dictionary.py in the section on the set command family.

choksi81 commented 10 years ago

Thanks to Alan's support, one third of the work is already done: Find attached a patch that introduces the set maxthreads command in seash as a last resort. As I said, two thirds of the work still lie ahead: Unit tests, and meaningful documentation for the end user.

choksi81 commented 10 years ago

Are we sure this will fix the problem? Has this been verified?

choksi81 commented 10 years ago

I will check this with my usual set of tools (slow node with multiple vessels, tcpdump), and Alan probably has checked it already. I won't commit before I'm sure this solves the problem. If you want a "more correct" solution instead of a workaround to what I'd consider a rare problem anyway, see my comment above on rearranging the vessel list.

choksi81 commented 10 years ago

Attachments: https://github.com/SeattleTestbed/attic/blob/master/TICKET_ATTACHMENTS/seash_set_maxthreads.diff