Snapshot support - Githubissues

krasserm commented 12 years ago

Snapshot storage
Snapshot-based recovery

ahjohannessen commented 11 years ago

Hi Martin, what is the status on this feature?

krasserm commented 11 years ago

Hi Alex, its status is not yet started.

ponythewhite commented 11 years ago

May I ask if you have any thoughts on how to implement it? Make actors support receiving and generating SnapshotMessage(state) which would hold the state?

Also do you plan using same or different stores for storing snapshots as for events? If there was separate journal-file per processor/channel, then it would be possible to act like this when a SnapshotMessage is forwarded to jorunal: close old jorunal, append SnapshotMessage as first message to new journal, continue work.. This would guarantee seamless operation and minimal changes to current implementation.

In case of current situation, when one journal exists per EventsourcingExtension - it would still be possible but all eventsourced's would have to be snapshotted at the same time. The advantage is here that the old journals would be ready to archive, but if you wouldn't want everything to be snapshotted - copy the old (nonsnapshotted) events to the new journal?

Maybe you're thinking about a different approach, I would be glad to help with the implementation if you have any ideas how to do it.

Kind regards,

Jacek

krasserm commented 11 years ago

Hi Jacek,

thanks a lot for sharing your ideas and offering help. I didn't go that far with ideas how to best implement snapshotting but here are some general thoughts:

Snapshotting should be decoupled as far as possible from the current infrastructure. Applications should already be able to do snapshotting as described in this post without any further support from the library.
Snapshots should be stored independently from the journal. This way, an application could add separate nodes running read models that re-create state from replicated messages and do snapshotting without impacting other parts of the application.
It should be possible to do snapshotting on a per processor basis (see also comments on consistent global snapshotting below).
I'd also make snapshotting independent of deletion of journal entries. Applications anyway want to keep the whole event history, so journal entries should only be deleted if they have been successfully moved to an archive. An archive may or may not be independent from a snapshot store.
The EventsourcingExtension could internally run a snapshotter actor that sends TakeSnapshot messages to all registered processors. Processors could then reply with serlialized state + some metadata (like last sequence nr included in the snapshot etc) and the snapshotter stores the snapshots on their behalf.
On registration (prior to or part of recovery), processors could be initialized with an InitFromSnapshot message (or whatever). Replay would then start with the next message following the snapshot. This can also be coordinated by the EventsourcingExtension.

If there was separate journal-file per processor/channel, ...

There should be only a single journal per disk, for optimal write throughput (not the case with SSDs though).

it would still be possible but all eventsourced's would have to be snapshotted at the same time ...

Snapshotting at the same time is not possible since all processors are running concurrently. Consistent snapshotting in distributed systems is far from trivial (and maybe many apps can do without having a global consistent snapshot).

WDYT?

Cheers, Martin

ponythewhite commented 11 years ago

Martin, thank you for your reply.

I understand the arguments against consistent snapshotting - that would require passing around some synchronization messages which would last indefinitely in case of long running jobs. Integration with redundancy + failover would totally lose the nice appeal of this feature.

All your pointed out thoughts regarding snapshotting seem very correct to me, it is a good way to follow. I just have one problem with this approach but I guess there's nothing one can do to solve it: if snapshotting on per-processor basis successfully moves messages to archive, then they are deleted from the journal. As we have one journal per disk, it stores messages from multiple processors/channels. So deleting messages from a single processor would introduce fragmentation to the journal. Aren't you afraid that such behavior would strongly impact journal performance in case of replaying messages?

I am trying to implement ES in a distributed environment where message per second factor is between 50,000 and 3,000,000 + there is a strong imbalance in flow rate per different processors. I am deeply worried that deleting messages from the heaviest loaded processor's journal would degrade performance of the whole system. Due to large message counts snapshotting + archiving is a must. (Archiving through Kafka to Cassandra with batching)

That's why I proposed creation of a new journal during snapshot - to avoid deletion. Are my assumptions real in any way or would that have no impact on performance?

Kind regards,

Jacek

krasserm commented 11 years ago

I just have one problem with this approach but I guess there's nothing one can do to solve it: if snapshotting on per-processor basis successfully moves messages to archive, then they are deleted from the journal. As we have one journal per disk, it stores messages from multiple processors/channels. So deleting messages from a single processor would introduce fragmentation to the journal. Aren't you afraid that such behavior would strongly impact journal performance in case of replaying messages?

As mentioned in my previous answer, snapshotting and archiving should be completely independent of each other. Saving a snapshot does not need to trigger archiving (together with deletion of entries from the journal). A strategy that takes a snapshot every 3 hours and a bulk move of entries older than 24 hours from journal(s) to an archive should work fine. In this case you can remove entries from the journal independent of the processor id. Depending on the journal implementation, deleting entries older than x can be very efficient (such as deleting old ledgers from Bokkkeeper, for example). LevelDB supports bulk operations as well, so deletion of entries up to x shouldn't take too much time but I don't have any concrete numbers at the moment.

Does that make sense to you?

Cheers, Martin

ornicar commented 11 years ago

I'm also looking forward snapshot support.

A million thanks for the work you put in this awesome library, Martin.

krasserm commented 11 years ago

You're most welcome, Thibault. Glad that you like it. Snapshot support should make it into the next release.

krasserm commented 11 years ago

Proposal created.

ahjohannessen commented 11 years ago

I think your work looks very nice :) Looking forward to try it out! :+1:

krasserm commented 11 years ago

Feedback summary:

Hide processorId in SnapshotRequest and SnapshotOffer.snapshot
Several snapshots per processor and selection criteria on ReplayParams (incl. time-based criteria)
Journals must be configurable with serialization strategies (which can be re-used across journals)

eligosource / eventsourced

Snapshot support #8