Questions about streams

delaneyj commented 7 years ago

@savaki this is more just a question of how you deal with streams using a AWS pipeline. Watching your talk the overall structure makes sense but details are a little fuzzy for someone not engrossed in the AWS stack.

With event stores I've used before you'd have a features such as indexing like in EventStore which can allow for viewing of all events across the system, per aggregate type, or via your own projections using linkTo(streamId, event, metadata).

I know event sourcing and CQRS are orthogonal but how do you deal with the query/projection side after events are written out to Dynamo?

savaki commented 7 years ago

That's a great observation. It's something that I'm fleshing out now as more and more of this goes into production. There are a couple of approaches we've been working through and since we use dynamodb as our store, I'll describe them from that perspective.

Option #1 Firehose + S3 The simplest option we've experimented with is attaching Kinesis Firehose to the dynamodb stream. This enables content to be pushed to S3, Redshift, and Elasticsearch. For Redshift and Elasticsearch, you'll almost certainly need to use lambda transformation to massage the shape to something more appropriate the Redshift and Elasticsearch. Internally, we use S3 for replay and and Elasticsearch for query so this works out great for us.

For custom streams, which I think is similar to what EventStore does, we attach lambda functions directly to the dynamodb tables. This provides visibility to all events within a single bounded context.

Option #2 Kafka + S3 We save events into dynamodb. Using dynamodb streams with lambda, events are pushed into a kafka topic associated with the dynamodb table. Kafka is only intended as ephemeral storage. My intent was that S3 was to be long term storage. To replay a stream, one would read the events from S3 until you catch up to what's held in Kafka and then switch to Kafka. My expectation is that we'd make our tool available here.

**Option #3 Nats Streaming + S3*** Almost exactly the same as the Kafka solution except with Nats Streaming instead of Kafka. The main pros for Nats Streaming are its simplicity including the ability to create subjects (topics) on the fly.

Using S3 One consistent theme across all the options is the use of S3 as long term storage. S3 has the benefit of being both easy to understand, but well integrated into the AWS ecosystem. And for immutable data, S3 provides a consistent read-after-write guarantees which are very useful.

It's easier methinks if I describe this via an example. Let's suppose we have 3 bounded contexts, (A, B, and C) feeding data into S3. And for simplicity, let's suppose all the data feeds into a single S3 bucket, events.

Let's further assume the pattern for saving events is a chronological directory structure like:

s3://{buckets}/{bounded-context}/{year}/{month}/{day}/{hour}-{unique-identifier}

e.g. for an event generated today

s3://events/A/2017/07/18/09-892783abcd-2988d.gz

Each file would contain 1 or more events in sequential order. Ok, with that huge given, let's go through a few use cases.

Case - Simple aggregator using a single bounded context

Goal: Create a projection, Sum, off of the bounded context, A, that aggregates the value of the X field within the event stream.

Here we can create a lambda function that triggers off inserts into the S3 bucket, s3://events/A/... that reads from A and writes into whatever bucket Sum uses. e.g.

s3://events/A/2017/07/18/09-892783abcd-2988d.gz

triggers a lambda function which in turn generates a new file,

s3://aggregates/Sum/2017/07/18/09-892783abcd-2988d.gz

savaki commented 7 years ago

I think it's probably worthwhile to say a little about motivations for this project.

Serverless One strong one for me is serverless. For me, that's less about lambda and more about operational responsibility. I'll maintain operational responsibility for the overall system, but I want the responsibility for individual servers (upgrades, security patches, recovery after failure) to be a service provider's responsibility, not mine.

I looked over EventStore before starting this and decided on this approach for a couple of reasons. We don't have any windows server expertise in-house nor would we want to develop any so using the windows version is out. On the unix side, I see there's an ubuntu 14.04 version that uses mono so already I'm a little nervous. I'm surprised there's not also a ubuntu 16.04 version and I have some concerns about mono. For me, it read like and yes, it also runs on linux, but as a second class citizen. Don't get me wrong, I think EventStore is a great product, but not quite what we're looking for.

Simplicity Of course, everyone wants their system to be simple and developer friendly, but I would also like the internals of the system to be as easy to reason about as possible. So if something were to go wrong, I'd like the recovery process to be clear, because how the system works is clear.

For example, the goal in using S3 the way we are is to turn many operations into the unix equivalent of:

cat {input} | {lambda} > {output}

Ecosystem The last issue is really around ecosystem. I have no doubt EventStore will stand the test of time so I'm not thinking about that. It's more the belief I have that if I can offload events as quickly as possible into a more robust ecosystem like AWS or Kafka, the tooling and support will rapidly grow to provide even more functionality when I'm ready for it. For example, Kafka recently announced support for guaranteed once delivery and windowing via the Kafka Streams API.

My goal is to provide a minimal Golang library for event sourcing and then to forward those events into best of breed technologies.

delaneyj commented 7 years ago

I was using EventStore's as an example of an API for querying events and using projections; not actually recommending it for this use case. For example it wasn't clear from the talk how you play catch up if the lambda or stream for some reason go offline and miss events (or perhaps didn't get setup prior to events coming in). One other thing that was interesting about EventStore's approach was the use of E-Tags to basically setup events to be indefinitely cache-able.

I really am interested in 'surrendering to the cloud' but with options to move providers. Maybe the Repository interface needs some query methods exposed so in the original case it's pulling from S3/Kafka/etc or directly from fs/raft in local situations. For my use case have been considering forking and wrapping in Grpc for the streaming aspect.

Thanks for the response and great talk!

savaki commented 7 years ago

Just so I understand what you're thinking. Currently, the system load events from a Store which can be one of dynamodb, mysql, or postgresql. Is your thinking that you'd also like to be able to rebuild individual aggregates from kafka/s3/fs/etc? Also, how would you see grpc fitting in? Trying to understand your use case(s)

delaneyj commented 7 years ago

brace for rambling...

I have a few use cases actually.

A regulatory constrained application at work where data has to be local to the regional office
Personal stuff

So in most event sourcing libraries I've seen there repository has the ability to retrieve events to anyone in a read only fashion. It appears you are relying on the AWS/GCloud glue to be able to replay events at runtime. Stuff that seems to be missing in this implementation roughly

Event{
  AggregateID 
  AggregateType 
  At
  ID
  Data
Checksum
}
allEvents() []*Event
allAggregateEvents(aggregateType string) []*Event

Normally you'd have to make a poll to the server to get the latest events. However with gRPC you can support streams from the server. That means a listener can make a single request (which has info like aggregate type, lastKnownEventNumber,etc) and the server is able to play catch for the client, but also feed new events on the same connection.

In my case I don't have the equivalent of firehose and the devops benefits of single binary mean I want to in enhance the Repository part to be able to stream events to anyone since they are immutable. The way its currently approaches playback makes total sense in terms of 'giving up to the cloud' but makes it harder to use both in other cloud environments and locally.

delaneyj commented 7 years ago

Missed quite a bit of detail. Since it wasn't covered in the talk or simple example didn't realize that there was the StreamReader interface that does what I was thinking. Just need to wrap in some gRPC goodness to get 90% of the way there. Updated my pull request to include the same tests used by MySQL back end but modified to use Bolt.

andrzejsliwa commented 7 years ago

What about rebuilding the read models based on multiple Aggregates?, how you are coordinating "order" of events when you keep them in bulk form. Lets say I would need to rebuild my read-store, how I would read all events (from multiple aggregates) in proper order (global applying order)?

delaneyj commented 7 years ago

Look at the StreamReader, you get back all events in global order across aggregates. However if you have multiple aggregate types it breaks down. The stuff I'm considering doing in gRPC would allow for global/aggregate type/uuid level queries from the event bus side. The way eventsource is setup right now it relies on cloud services like firehouse to provide that functionality.

andrzejsliwa commented 7 years ago

@delaneyj the problem is that Dynamo DB store is not implementing this interface (and I forgot to mention that I'm asking about Dynamo DB store)

delaneyj commented 7 years ago

Right, from the talk they are relying on other parts of AWS to handle the event bus related functionality. Its not a complete CQRS style framework/lib, just the basics of event sourcing.

savaki commented 7 years ago

Yes. For now, this is just a pure event sourcing core. Internally, we do CQRS on top of it, but the tools we're using are currently too tied in with our stuff and under heavy development so I didn't think it'd be appropriate to open source them. But in a few months after things have died down, the plan is to make them available.

One question for both of you, @andrzejsliwa and @delaneyj, what scenarios do you see requiring a global ordering guarantee rather than a aggregate local ordering guarantee? Currently, the mysql and postgresql implementations provide a globally unique ordering guarantee, but the dynamodb does not.

delaneyj commented 7 years ago

@savaki. The global order matters for stuff like long running processes (sagas) mostly in my case. That and being able to reconstruct the state back to any point in time with accuracy. My use case is with a regulatory audit log being paramount.

savaki commented 7 years ago

Both the mysql and postgresql implementations provide globally unique ordering via the StreamReader. The offset in the StreamReader is global and I think what you'd be looking for.

savaki commented 7 years ago

@andr

bundgaard commented 3 years ago

Hello @savaki how are things looking in 2021, with the CQRS stuff, you mentioned in this issue. Is it possible for you guys to release it to the world or a way of describing how you went about it in our Golang world?

altairsix / eventsource

Questions about streams #8