Implement a write ahead log instead of a write behind log

sagarjha commented 6 years ago

For a Derecho subgroup in persistence mode (defined as the existence of a persisted field in the subgroup class), our delivery + persistence layer breaks under a total failure. Our persistence layer implements write behind logging in which we first deliver the message that is still in memory and then only proceed to persist the changes. If a message is delivered on some or all of the nodes, but persisted only on some of the nodes just before a total failure occurs, then the total restart can miss that message completely. This will happen in case we decided to restart just the nodes that delivered that message, but did not persist it (but the group can still be provisioned). This is because the logs of the restarting nodes, from which the state recovery is performed, will not have any mention of that message. We want to fix this by inverting the order of delivery and persistence. We will do local persistence as soon as we receive the message and only then run our stability and delivery protocols. In that case, delivery has the strong guarantee that all the logs of the subgroup members have the new state. If we locally persist a message, but the delivery does not go through before a total failure occurs, then the persistence layer should be smart enough to figure that out and not replay it during recovery (which it probably already does). Where all of this matters is in the context of exposing the delivered state to the outside world. Specifically, the reply we generate in case of an ordered_query that is sent back to the issuer upon delivery.

sagarjha commented 6 years ago

To be clear, I am using delivery to mean atomic multicast delivery. If you think of delivery in case of persistence to mean that the state has been persisted safely on all or a quorum of nodes, then you have the right understanding. But our implementation does provide that guarantee! It will do that once this issue is fixed. Although, this is only a problem in case you have a total failure.

KenBirman commented 6 years ago

No, not a quorum. It must be on all nodes.

On Nov 15, 2018, at 5:01 PM, Sagar Jha notifications@github.com<mailto:notifications@github.com> wrote:

To be clear, I am using delivery to mean atomic multicast delivery. If you think of delivery in case of persistence to mean that the state has been persisted safely on all or a quorum of nodes, then you have the right understanding. But our implementation does provide that guarantee! It will do that once this issue is fixed. Although, this is only a problem in case you have a total failure.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Derecho-Project/derecho-unified/issues/74#issuecomment-439206163, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWDC70z_nnJ-sbcwR8juh08mgQNMId4mks5uveRBgaJpZM4YjKNY.

KenBirman commented 6 years ago

Another potential risk would be that our current (buggy) implementation could lose an update, whether or not the sender thinks the update was persisted. The reason is that our current logic may fail to reissue an update that becomes stable in an atomic multicast sense, but then fails to become persistently stable due to a crash and is rolled back by the ragged trim logic.

The corrected code would do the event handler upcall as soon as the message arrives, then persist the new object version or delta. Hence write-ahead. Then the nreceived counter (really, for this case, it acts like an nreceived_and_persisted counter) can be incremented. Then the stability aggregate works as required, detecting persistent stability.

The code changes required should be quite small. In a group with persistent state ( a Paxos group), we would do delivery when we have all the messages needed for local ordering, but without waiting for stability. Weijia’s logic will increment nreceived once the persistence logic knows that the update has been safely logged. The rest of our code remains unchanged. So basically, we modify one line of existing code, but Weijia adds a few lines of logic to the version vector subsystem.

The futures mechanism happens to wait for stability, so it would then be impossible to see the results of an ordered_query until the data is stable, and hence the Paxos guarantees hold.

To provoke this bug right now might be very difficult. It comes down to a race condition, and you would probably need to persist extremely large objects, and would have to fail one member of a shard precisely at the right instant (while it is updating the log), but then on recovery, restart from the failed member and not the other shard member(s).

The TOCS submission is correct... the issue was with the implementation, not the protocol logic...

sagarjha commented 6 years ago

No, not a quorum. It must be on all nodes. On Nov 15, 2018, at 5:01 PM, Sagar Jha notifications@github.com<mailto:notifications@github.com> wrote: To be clear, I am using delivery to mean atomic multicast delivery. If you think of delivery in case of persistence to mean that the state has been persisted safely on all or a quorum of nodes, then you have the right understanding. But our implementation does provide that guarantee! It will do that once this issue is fixed. Although, this is only a problem in case you have a total failure. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#74 (comment)>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWDC70z_nnJ-sbcwR8juh08mgQNMId4mks5uveRBgaJpZM4YjKNY.

Yeah, I wasn't talking about quorums in the context of Derecho. Just in general.

Derecho-Project / derecho

Implement a write ahead log instead of a write behind log #74