A snapshot of an aggregate may be corrupted due to eventually consistent history

dmytro-grankin commented 6 years ago

Currently, it is possible, that a corrupted aggregate snapshot will be written to a storage.

The problem happens when an eventually consistent storage is used (e.g. Datastore).

Let's look at the problem using a task creation example. A normal scenario is the following: TaskCreated and TaskAssigned events are already stored in the aggregate history. Then, the aggregate is loaded and the events are played, StartTask command is dispatched and TaskStarted event is emitted and applied to the aggregate. When the aggregate is stored, the snapshot trigger is reached and a snapshot is stored (see the picture below).

But, after an event was stored in an aggregate history, it may be unavailable during the next history read operations. A problem scenario: TaskCreated and TaskAssigned events are already stored in the aggregate history. Then, the aggregate is loaded and only TaskCreated event is played; TaskAssigned event is not returned from the history backward due to eventual consistency. StartTask command is dispatched and TaskStarted event is emitted and applied to the aggregate. When the aggregate is stored, the snapshot trigger is reached and a snapshot is stored. But, because TaskAssigned event was not available and hence not played before applying of TaskStarted, the task snapshot has a missing assignee (see the picture below).

In other words, if the problem happens, the following is true: the number of played events (excluding a snapshot) don't equal to the event count after the last snapshot (AggregateStorage.readEventCountAfterLastSnapshot(...)).

Also, when using Datastore, it is possible, that an event from the middle of an aggregate history won't be available. So, the fix should take this into account.

The framework should not store a wrong snapshot as in the example above or somehow deal with the eventual consistency of an aggregate history.

alexander-yevsyukov commented 4 years ago

Hopefully this is going to be addressed by #1259. @armiol is adjusting the way snapshots are made.

alexander-yevsyukov commented 4 years ago

After some discussion with @armiol we decided to postpone the fix. We cannot fix it under 1.x without significant performance penalty. The problem does not manifest often. The delay with the fix is not going to impact the current users of the framework.

SpineEventEngine / core-java

A snapshot of an aggregate may be corrupted due to eventually consistent history #838