GaloisInc / HaLVM

The Haskell Lightweight Virtual Machine (HaLVM): GHC running on Xen
BSD 3-Clause "New" or "Revised" License
1.05k stars 88 forks source link

Thread failing to return from XenStore transactions #40

Closed thumphries closed 9 years ago

thumphries commented 9 years ago

Using the threaded runtime, threads occasionally fail to wake up from threadDelay. Edit: Using either runtime, threads occasionally fail to return from XenStore transactions.

The example is similar to that of #39, which is just ClientRendezvous with 15 clients. A small change to the rendezvous protocol to make it robust to #39 exposes the weirdness. Look for calls to the busy-wait waitForKey in Rendezvous.hs (basically defined as xsRead >> threadDelay). Seems like certain threads are starved indefinitely, and only under contention.

Reproduce by running the example until clients fail to halt, and then checking the console for both Server and the dangling Client.

For example, dom593 is left dangling, and the Server log / xenstore-ls show the connection was initiated correctly:

  dom592 = ""
  ServerAcknowledge = "True"
   ClientGrants = "[grant:2,grant:3]"
   ClientPorts = "[echan:5]"
   ServerConfirmed = "True"
 dom593 = ""
   ServerAcknowledge = "True"
   ClientGrants = "[grant:2,grant:3]"
   ClientPorts = "[echan:5]"
  dom594 = ""
   ServerAcknowledge = "True"
   ClientGrants = "[grant:2,grant:3]"
   ClientPorts = "[echan:5]"
   ServerConfirmed = "True"

The Server's console shows it waited for dom593's grants for one iteration, then served a few other clients, and never woke up from that threadDelay:

Waiting for /rendezvous/ClientServerTest/dom590/ClientGrantsWaiting for /rendezvous/ClientServerTest/dom592/ClientGrants

Waiting for /rendezvous/ClientServerTest/dom591/ClientGrants
Waiting for /rendezvous/ClientServerTest/dom590/ClientPortsWaiting for /rendezvous/ClientServerTest/dom592/ClientPortsWaiting for /rendezvous/ClientServerTest/dom591/ClientPorts

Waiting for /rendezvous/ClientServerTest/dom593/ClientGrants
Waiting for /rendezvous/ClientServerTest/dom594/ClientGrants
Waiting for /rendezvous/ClientServerTest/dom594/ClientGrants

Over at dom593's console, it spins for a while, then falls silent. It gave up after 402 attempts, although the domain was still running.

Waiting for /rendezvous/ClientServerTest/ServerDomId Waiting for /rendezvous/ClientServerTest/dom593/ServerConfirmed Waiting for /rendezvous/ClientServerTest/dom593/ServerConfirmed Waiting for /rendezvous/ClientServerTest/dom593/ServerConfirmed Waiting for /rendezvous/ClientServerTest/dom593/ServerConfirmed Waiting for /rendezvous/ClientServerTest/dom593/ServerConfirmed Waiting for /rendezvous/ClientServerTest/dom593/ServerConfirmed

Thread never wakes up on the server. Issue has also shown up on the client side. This example uses the threaded runtime for both server and client. I'm running FC20 and Xen 4.3.3.

thumphries commented 9 years ago

Curiously, I'm getting precisely the same behaviour from the non-threaded runtime.

The only difference is in the client that's left dangling - now it fills up my log dir, whereas before it fell silent after a time.

thumphries commented 9 years ago

Updated the gist after getting suspicious that xsRead and xsWrite operations were failing to complete.

Looks like that is the case! XenStore transactions are to blame.

thumphries commented 9 years ago

aaaaaand fixed