cloudstateio / cloudstate

Distributed State Management for Serverless
https://cloudstate.io
Apache License 2.0
762 stars 97 forks source link

Event sourcing – sequence of handling emitted events (JavaScript- vs. Java-Support) #375

Open marcellanz opened 4 years ago

marcellanz commented 4 years ago

The documentation for the Java Event Sourcing Support states that:

Invoking emit will immediately invoke the associated event handler for that event - this both validates that the event can be applied to the current state, as well as updates the state so that subsequent processing in the command handler can use it.

and this seems to be implemented that way.

For JavaScript, it is not defined by the documentation what the sequence of emitting events and the handling of emitted events is. The JavaScript implementation seems not to follow the same sequence as the one of the Java Support I think. The JavaScript implementation instead, collects all emitted events while running the command handler and then, after the command was handled, the collected events get handled as a whole.

If I understood it right, I think it is relevant for both, the error behaviour as well as what the entity state would be after a call to emit() in the command handler; or further calls to emit() within the same handler.

@jroper, I would assume both implementations having the same behaviour, wdyt?

(I tried to validate this with a unit-test for JavaScript. I started, but after a while stopped as it, for me, was not that simple to do. I think there are no tests like the ones by crdt-handler-test.js. There is js-shopping-cart/test/test.js but they have a different scope.)

jroper commented 4 years ago

I think we should rewrite the event sourced support for JavaScript. I don't like it. We actually have someone working on a redux like API at the moment, so let's see what comes of that.

marcellanz commented 4 years ago

I think with #392 we can close this one. wdyt @jroper @pvlugter ? Although we should change the documentation about command handler can use the local state in subsequent processing.

marcellanz commented 4 years ago

At the same time, I find the ability to loose local changed state for subsequent processing a bit of a drawback; as it is a feature that is lost because of a condition that usually happens sparingly.

pvlugter commented 4 years ago

@marcellanz, agree that being able to use the local state after emitting events in the current command handler is useful, and it would be better to retain that. I imagine a common use case is returning the updated state in command replies, rather than returning empty and requiring a subsequent get command.

Agree that the behaviour should be consistent across language support implementations. And we also need to maintain consistent state on failures. But I don't think we should require the language supports to track previous snapshots and events for rollbacks on the emit-then-fail scenario.

Here's an approach we could adopt: events are always applied immediately, so that the updated state is available. If context.fail is called, and there are no events emitted yet, then the failure is returned to the client and processing continues, state is consistent (same behaviour as now). If context.fail is called and events have been emitted (and processed locally), then the failure is escalated into a different kind of fail: the failure is returned to the client, but the proxy entity is also restarted — which will rollback the state, and the restart is immediate. We could also return and apply the before-failure events, but I think atomic behaviour for command processing is more natural. While unexpected failures remain as they are now, the proxy entity is stopped (and may be restarted on subsequent commands) and only generic failure messages are returned to the client.

So three kinds of failures:

marcellanz commented 4 years ago

@pvlugter sounds good. This way the proxy has all information about what happened regarding the entities state and the freedom to choose what to do then.

With "client action failure with events emitted" would you allow the user support add the probably partly emitted events beside the EventSourcedReply.client_action.failure on EventSourcedReply.events and then let the proxy decide what to do when it finds the events emitted beside the client action failure?

sleipnir commented 4 years ago

What would be the exact consequence of this stop?

unexpected failure -> generic failure message sent to client and stop the proxy entity
pvlugter commented 4 years ago

With "client action failure with events emitted" would you allow the user support add the probably partly emitted events beside the EventSourcedReply.client_action.failure on EventSourcedReply.events and then let the proxy decide what to do when it finds the events emitted beside the client action failure?

We could do that. However, I also think that it's preferable in general to have the protocol clearly express intent and make it hard to do the wrong thing. If events are emitted with a client action failure, then just looking at the protocol interaction, I would expect the events to be persisted (which is currently supported by both the protocol and the proxy implementation), and thinking about it some more, this could actually be both useful to support and semantically clearer.

Maybe it's useful to have examples of emitting multiple events with possible failures, to see what behaviour we would want and expect. For the simple case of either emit an event or fail the command, then the usual approach is check if valid and emit or otherwise fail. If we extended the shopping cart example with a new method AddItems (plural), for adding multiple items at once, then you could see this as programmed as iterate through the items and emit an event for each item, but call context.fail for an invalid item. For language supports with an imperative style and that throw an exception, it stops processing any subsequent items. For the validated events already emitted, they could either be applied successfully (where the failure is "processing got to position N before failing") or the emitted events are effectively rolled back by restoring the original state in some way. In this situation of validating multiple items, you probably prefer to not fail fast but instead to accumulate all the validation errors and return them together — which is something you could build into the user service protocol, or return them combined in one failure message (validation support could be a nice extra feature in language supports). And you could see either applying the valid events while rejecting the invalid events, or atomically rejecting all events, as possible approaches. If needed, you can always code an atomic approach yourself, first validating and updating a temporary state in the command handler, and then deciding to actually context.emit all events or context.fail the command at the end, but I'm wondering if this should be built-in.

So I think we need to clarify: should calling context.emit mean that the event is always emitted by default, even if there's a subsequent context.fail reply. This is supported by the protocol and proxy, although not implemented currently in the language supports. And then a language support could provide a context.atomically or similar, where it effectively wraps the emits in an atomic transaction. So rather than context.emit assuming atomicity for a command handler, the emits are individually effective unless wrapped in an atomic. On the protocol side, we allow events with failure replies. And for implementing context.atomic there's a new failure-with-restart flag to reset the state from the proxy, rather than tracking in the language support. Creating an atomic block just changes the send events or set restart flag in the reply.

pvlugter commented 4 years ago

What would be the exact consequence of this stop?

unexpected failure -> generic failure message sent to client and stop the proxy entity

So this is already the current behaviour.

Situation: an unexpected exception during command processing (not context.fail). If this is runtime-fatal, such as out-of-memory, then the user service container should be exited/crashed instead, allowing kubernetes to restart it. Otherwise for non-runtime-fatal exceptions, the language support returns an entity failure. The language support may also directly close the grpc stream for this entity instance, and clear the entity instance state, after returning the entity failure.

Consequences: the proxy sends a generic failure message (to avoid leaking specific details) for the current command, immediately sends failure replies for all outstanding queued commands, closes the event sourced streaming grpc call for this entity instance to clear the state (if not already closed and cleared), and stops the persistent actor for this entity instance. Other entity instances are not affected. A subsequent command for this entity key will restart the entity instance, similar to if the entity instance was passivated rather than crashed.

sleipnir commented 4 years ago

Okay. I thought it was a new behavior. Thanks for the excellent explanation. Thanks @pvlugter

pvlugter commented 4 years ago

I thought it was a new behavior.

The new behaviour would be signalling an entity instance restart on intended failures. The crash/stop will fail all outstanding commands and stop the actor, restarting on a new command; while the restart would only fail the current command, restart the actor (and the entity state) immediately, and could keep processing outstanding commands — it would just be to reset the state.

sleipnir commented 4 years ago

Perfect

jroper commented 4 years ago

There is an alternative, before (or after) the handling of each command, the support library could snapshot the state (though not necessarily send the snapshot back to the proxy). Then, in the event that command handling fails and events are emitted, the support library can instantiate a new entity and pass it the snapshot. This would be a lot cheaper than recovering from the journal.

marcellanz commented 4 years ago

Discussed this option here too https://github.com/cloudstateio/cloudstate/pull/392#discussion_r463552743

It has the disadvantage of a snapshot to be made for every command no matter if there is a failure vs. the relatively low impact of additional communication by the proxy in case of a failure (relative to its ocurrence).

I agree with that it should be clear by intent what is happening in case of a failure so the proxy does not have to guess. With options @pvlugter describes the language support can choose for, local snapshot-restore could be one option, if the user support provides it (although all can). One other point might be, who can decide if a failure is recoverable or what should happen with partly emitted events. Perhaps the user support, hence the user, can better. If the user doesnt care or the user is not aware of, he could fail with an option choosen or an exception does and the proxy restores state when no other option is choosen by the user.

jroper commented 4 years ago

It has the disadvantage of a snapshot to be made for every command no matter if there is a failure vs. the relatively low impact of additional communication by the proxy in case of a failure (relative to its ocurrence).

Not a persisted snapshot. The support library would just get a snapshot of the state, and hold it immutably in memory. It wouldn't send the snapshot to the proxy for persistence. The overhead would be tiny, if it's worth even considering at all.

It's not just additional communication with the proxy in case of a failure, it's also communication with the DB, if the proxy resets the state, that will mean a full rehydration of the entity, ie, loading the most recent snapshot, then loading all the events. If an entity is returning a high throughput of validation errors, this could exert quite a load on the database.

pvlugter commented 4 years ago

One other point might be, who can decide if a failure is recoverable or what should happen with partly emitted events. Perhaps the user support, hence the user, can better. If the user doesnt care or the user is not aware of, he could fail with an option choosen or an exception does and the proxy restores state when no other option is choosen by the user.

Yes, letting the user decide is why I suggested having an explicit atomic marker or similar in the user API. So the default would be to always immediately apply any emitted events, and send these back to the proxy for persistence, even if there's a context.fail. While if there's an atomic marker then it will rollback on failure — which could be done using local snapshots, or via the proxy entity restart.

pvlugter commented 4 years ago

There is an alternative, before (or after) the handling of each command, the support library could snapshot the state (though not necessarily send the snapshot back to the proxy). Then, in the event that command handling fails and events are emitted, the support library can instantiate a new entity and pass it the snapshot. This would be a lot cheaper than recovering from the journal.

Yes, would be cheaper. Assuming this would use the snapshot and snapshot handler, are they required to be provided? I thought an entity could disable snapshots and not implement these. Although I guess it could only do this if snapshots are enabled and fall back to replaying the events by restarting the entity via the proxy otherwise. And if we did distinguish atomic commands, then doesn't need to snapshot by default.

sleipnir commented 4 years ago

As cheap as an operation on the support language side it worries me a lot because it takes away the uniformity of the expected behavior between different implementations, any measures on the support language side must be very well validated by the TCK. I find the transaction semantics suggested by Peter simpler. Another question is that in the example of the list that Peter presented if I issue a failure to a member of the list he would fail all subsequent items, right? But what if this is not the behavior expected by the client? What if I want to miss just that specific item on the list? That is why it is still valid to separate the types of failures on the side of the support language, to have more types of failure events that tell exactly and without a doubt to the proxy what it should do in different situations.

Perhaps Peter went in that direction when he cited support for some kind of validation. I had suggested using the Rust expectation semantics for failures that are only used in the sense of validations

pvlugter commented 4 years ago

From today's discussion on the contributors' call:

pvlugter commented 3 years ago

Java support is updated to request restarts on failures after emitting events. TCK updated to require events to be applied immediately so that responses contain updated state, and to test the different failure scenarios.

JavaScript still needs to be updated so that events are applied immediately in some way that allows responses to return the updated state and to implement the new TCK.

marcellanz commented 3 years ago

Go support does the same as Java support for now. Not yet exposing any API to tailor the behaviour. Do we introduce restarts by user support for other state models? I think CRDT does not yet, and I've seen TCK for value entities does not expect the restart flag set.

pvlugter commented 3 years ago

Do we introduce restarts by user support for other state models?

Only event sourcing supports the restarts currently. Value entities probably shouldn't need restarts, as it's easy to rollback on failure in the user support. I don't think I've added change+failure tests for CRDTs — should do that in the TCK and also think about if it's useful to have restart support.