Open tmagrino opened 5 years ago
In theory we are supposed to have a transaction timeout capability, so that a long-running transaction like this gets gunned down and at that point the state would be checked.
I guess what I'm saying is that this problem doesn't seem specific to retry.
It's not specific to retry, but the code for performing an explicit retry
is special-cased in a way that doesn't exhibit the correct behavior we have with other retry scenarios.
For example, in nearly all other exception scenarios we check for stale objects in the loop, like here: https://github.com/apl-cornell/fabric/blob/099e3b0429e7cee060b780f7f3dbe0e29a97f1e3/src/system/fabric/worker/Worker.java#L822
However, the case for RetryException doesn't do this: https://github.com/apl-cornell/fabric/blob/099e3b0429e7cee060b780f7f3dbe0e29a97f1e3/src/system/fabric/worker/Worker.java#L798-L800
It's honestly a really simple fix. However, I think the logic in this loop has gotten a bit hairy and complicated with small changes over time in a way that suggests that there's a cleaner, harder-to-get-wrong rewrite of the transaction loop logic that should be considered.
I'm largely documenting the issue here to come back to later (tomorrow/this weekend) when I'm not debugging something else. :smiley:
Checking for stale objects is expensive, of course. But I guess retry happens rarely enough that it's not an issue?
I'm actually thinking we could do the check asynchronously here and potentially other places where we're going to retry regardless.
So yeah, this is like the timer idea we've tossed around for long-running transactions.
The timer thing is actually in the original SOSP paper but I guess it never got implemented.
The implementation of the
retry
keyword can cause problems when working off of stale worker cache data.For example
Will infinitely loop if
val.condition()
returns true only when state is inconsistent (the collection of cached objects read never existed at the same time on their respective stores in a consistent state).We should update the transaction loop to check for stale objects before running a user
retry
or userabort
.