It is possible for devices to get out of sync after one has been ‘left behind’ in a rebasing

drewmccormack commented 10 years ago

Scenario is this:

1) Device 1 is offline for a long time 2) Device 2 does a rebase, and leaves Device 1 behind. That is, the new baseline has a global count greater than the last event from Device 1. 3) Device 1 does some offline changes. 4) Device 1 comes back online. It uploads its recent changes. 5) Device 1 does a merge, and sees it is left behind. It does a complete rebuild. It leaves out the offline changes it had made, since they precede the new baseline. 6) Device 2 receives the offline changes made by Device 1 and applies them.

Result is that Device 1 does not have the offline changes, and Device 2 does.

The best approach would keep the changes from the left behind device, but that may not be easy to achieve.

An easy, but less satisfactory solution, would simply be to make sure no events that precede the baseline are ever applied. Perhaps we need to do a removal of events before the integration, or something like that. That way both devices would not get the offline changes, but at least they would be in sync.

ylin commented 10 years ago

For pending issues, like this, do you expect the fix to transparently apply to instances of ensemble that is already in production? Any expected timeframe for this particular issue to be solved? Sounds like a rare case but could be problematic for some users.

drewmccormack commented 10 years ago

I always try to make it back compatible.

I think I know how this arose now, and will likely fix it early next week.

Drew

On 05 Apr 2014, at 12:01, Yi Lin notifications@github.com wrote:

For pending issues, like this, do you expect the fix to transparently apply to instances of ensemble that is already in production? Any expected timeframe for this particular issue to be solved? Sounds like a rare case but could be problematic for some users.

— Reply to this email directly or view it on GitHub.

drewmccormack commented 10 years ago

After a lot of investigation, it is not clear what could cause this problem. It is not likely what we originally thought, and could just be caused by an interrupted save, or similar.

The rebasing criteria has been changed so that there are much less of them. The system where this bug appeared was carrying out a rebase virtually every save, which is far too frequent.

drewmccormack commented 10 years ago

This could have been caused by a race condition between the rebaser and the save monitor. Will be adding locks to make save atomic.

drewmccormack / ensembles

It is possible for devices to get out of sync after one has been ‘left behind’ in a rebasing #128