`retry` keyword does not check for stale objects before retrying.

apl-cornell / fabric

Distributed persistent programming language with secure information flow types

http://www.cs.cornell.edu/projects/fabric/

Other

28 stars 4 forks source link

`retry` keyword does not check for stale objects before retrying. #34

Open tmagrino opened 5 years ago

tmagrino commented 5 years ago

The implementation of the retry keyword can cause problems when working off of stale worker cache data.

For example

atomic {
  if (val.condition()) retry;
}

Will infinitely loop if val.condition() returns true only when state is inconsistent (the collection of cached objects read never existed at the same time on their respective stores in a consistent state).

We should update the transaction loop to check for stale objects before running a user retry or user abort.

andrewcmyers commented 5 years ago

In theory we are supposed to have a transaction timeout capability, so that a long-running transaction like this gets gunned down and at that point the state would be checked.

andrewcmyers commented 5 years ago

I guess what I'm saying is that this problem doesn't seem specific to retry.

tmagrino commented 5 years ago

It's not specific to retry, but the code for performing an explicit retry is special-cased in a way that doesn't exhibit the correct behavior we have with other retry scenarios.

tmagrino commented 5 years ago

For example, in nearly all other exception scenarios we check for stale objects in the loop, like here: https://github.com/apl-cornell/fabric/blob/099e3b0429e7cee060b780f7f3dbe0e29a97f1e3/src/system/fabric/worker/Worker.java#L822

However, the case for RetryException doesn't do this: https://github.com/apl-cornell/fabric/blob/099e3b0429e7cee060b780f7f3dbe0e29a97f1e3/src/system/fabric/worker/Worker.java#L798-L800

It's honestly a really simple fix. However, I think the logic in this loop has gotten a bit hairy and complicated with small changes over time in a way that suggests that there's a cleaner, harder-to-get-wrong rewrite of the transaction loop logic that should be considered.

I'm largely documenting the issue here to come back to later (tomorrow/this weekend) when I'm not debugging something else. :smiley:

andrewcmyers commented 5 years ago

Checking for stale objects is expensive, of course. But I guess retry happens rarely enough that it's not an issue?

tmagrino commented 5 years ago

I'm actually thinking we could do the check asynchronously here and potentially other places where we're going to retry regardless.

tmagrino commented 5 years ago

So yeah, this is like the timer idea we've tossed around for long-running transactions.

andrewcmyers commented 5 years ago

The timer thing is actually in the original SOSP paper but I guess it never got implemented.