GaloisInc / HaLVM

The Haskell Lightweight Virtual Machine (HaLVM): GHC running on Xen
BSD 3-Clause "New" or "Revised" License
1.05k stars 88 forks source link

XenStore watches (via xsWatch) are unreliable #39

Closed thumphries closed 9 years ago

thumphries commented 9 years ago

After installing a XenStore watch with callback, not all events get called back.

Extending the examples/Core/ClientRendezvous example to 15 clients should be sufficient to demonstrate. Most runs, I get between 11 and 14 successful rendezvous out of 15. I added a debug print to the server-side (left side) routine in Communication.Rendezvous.clientServerConnection, dumping the key to the console. This amended example is here

The most common scenario: the client proceeds correctly, and the server misses the initial xsMakeDirectory event. Server then receives and ignores the ClientGrants and ClientPorts messages. e.g. dom559 in the below example, who is lucky enough to be ignored twice (initial mkdir and ClientGrants). This seems to only happen under contention. Reproduce by repeating sudo make run until the Server and some Client domains fail to halt. I enabled Xen console logging to retrieve the audit trail.

XenStore watch fired for /rendezvous/ClientServerTest/dom557/ServerConfirmed XenStore watch fired for /rendezvous/ClientServerTest/dom558XenStore watch fired for /rendezvous/ClientServerTest/dom558/ClientGrants XenStore watch fired for /rendezvous/ClientServerTest/dom558/ClientPorts Waiting for /rendezvous/ClientServerTest/dom558/ClientGrants

Waiting for /rendezvous/ClientServerTest/dom558/ClientPorts XenStore watch fired for /rendezvous/ClientServerTest/dom558/ServerConfirmed XenStore watch fired for /rendezvous/ClientServerTest/dom559/ClientPorts XenStore watch fired for /rendezvous/ClientServerTest/dom560 Waiting for /rendezvous/ClientServerTest/dom560/ClientGrants XenStore watch fired for /rendezvous/ClientServerTest/dom560/ClientGrantsWaiting for /rendezvous/ClientServerTest/dom560/ClientPortsXenStore watch fired for /rendezvous/ClientServerTest /dom560/ClientPorts

XenStore watch fired for /rendezvous/ClientServerTest/dom560/ServerConfirmed XenStore watch fired for /rendezvous/ClientServerTest/dom561XenStore watch fired for /rendezvous/ClientServerTest/dom561/ClientGrants

Another scenario involves either client or server calling waitForKey (for grants, ports or ServerConfirmed messages) and the waiting thread never waking up from a threadDelay, but that will be another ticket.

thumphries commented 9 years ago

I've been working on a new version of the Rendezvous protocols that assumes XenStore's unreliability, but that work is blocked on the scheduling problem.