akka / akka-persistence-jdbc

Asynchronously writes journal and snapshot entries to configured JDBC databases so that Akka Actors can recover state
https://doc.akka.io/docs/akka-persistence-jdbc/
Other
308 stars 142 forks source link

Support for querying by a set of tags #651

Open PerWiklander opened 2 years ago

PerWiklander commented 2 years ago

Short description

It would be nice to be able to query on more than one tag at a time.

Let's say I'm creating a projection that is interested in events tagged with "apple", "banana" and "orange". I want to see them together in the order they were persisted.

Details

The proposed API is

readJournal.eventsByTags(
  Set(
    "apple",
    "banana",
    "orange",
  ),
  0L
)

and

readJournal.currentEventsByTags(
  Set(
    "apple",
    "banana",
    "orange",
  ),
  0L
)

Alternatively, please inform me on how I am totally wrong and how a projection would never need to handle events from multiple sources :-)

octonato commented 2 years ago

I guess you want to consume events from different kinds of entities with a single projection in order to build an aggregated view. Is that correct?

I can also imagine that you want a OR, not a AND. Correct? If you need an AND, I would simply use a distinct tag instead of three.

But back to the idea of having eventsByTags and currentEventsByTags, I think that we don't have it for historical reasons. I may be wrong, but I guess the plugin API involved from the Cassandra plugin and such a query would mean that we could potentially have to query many Cassandra nodes.

Ultimately, that could be added to the API and we could decide that not all plugin implementations are required to implement it. Actually, that's the case already. We have the APIs and plugins implement them according to the capacity of the underlying database. However, that would represent a not so trivial effort. And probably not worth if there are other means to achieve it.

I'm not sure about your specific use case, but if I inferred it correctly, you have different kind of entities all producing events with different tags and you want a projection capable of consuming all of them.

If that's the case, they all belong to the same application and I don't think it will be a bad design if they share the same tag. At the end of the day, tags are here to allow us to build read-side views. If I would need to build a Fruit Basket View with different fruits, then I would simply tag all my fruits with fruit-basket tag. For example:

You could consume events from each fruit independently, but also consume all the fruit-basket events. Would that make sense for your use case?

PerWiklander commented 2 years ago

The FruitBasket is indeed my intended use case. It would work to do as in your example. The problem lies in the fact that projections come and go while the entities are more stable (at least in our application). Adding new tags for any potential combination of events that a new projection can be interested in is less stable. Plus the fact that old events need to be retagged accordingly. I.e. the situation when there are 10 000 events before the FruitBasket concept is conceived.

Then there's the "need to know" factor. My fruits do not need to know they are part of an imagined basket.

octonato commented 2 years ago

I understand your concern about entities not needing to know about each other.

On the other hand, if they need to be consumed preserving the ordering, then it sounds to me that the transactional boundary is not in the right place. Looks like if you were using CRUD, you will have some joins and the different tables would have been updated on the same transaction. But because of event sourcing and the transaction boundary split, they are happening in different transactions now, but we need to reconstruct the ordering back.

Or maybe the ordering is not that important and the requirement could be relaxed? If you can relax it, you could have three projections writing to the same view. The view will be eventually consistent. It will have to accumulate data, but not showing until some point where the view would become 'presentable'.

Back to your idea of introducing search-by-tags, I think this will be non-trivial and maybe it will introduce a feature that's more of an edge case.

If you want to have such a query, I think the first option would be to try to get something alongside the Akka APIs that query your journal in the way you need. Like an alternative Query plugin. Then of course, once you have a working solution, we could discuss if this is something to be integrated back into Akka or not. WDYT?

PerWiklander commented 2 years ago

Thanks for the clarification. Our actions going forwards will be:

  1. Try to do

Or maybe the ordering is not that important and the requirement could be relaxed? If you can relax it, you could have three projections writing to the same view. The view will be eventually consistent. It will have > to accumulate data, but not showing until some point where the view would become 'presentable'.

I think this will work if we just analyse the transactions in more detail and find out what ACTUALLY happens. That usually does the trick :-)

  1. If we really need to, we will

    try to get something alongside the Akka APIs that query your journal in the way you need. Like an alternative Query plugin." and of course that would be communicated back here, or to Akka proper.

IMHO this issue can be closed.

patriknw commented 2 years ago

@PerWiklander I'm just curious, what database do you use?

PerWiklander commented 2 years ago

For the projections? Postgres at the moment, but will probably use Neo4j for most future views. Also some pure in memory views where the number of events per entity is low there is no need to see the data more than a few minutes after it happened.

patriknw commented 2 years ago

I you use Postgres for the write side also I would recommend that you take a look at https://doc.akka.io/docs/akka-persistence-r2dbc/current/index.html

There we take a different approach to how to read events for Projections. Instead of tags we use something we call slices. You can read more about that in the documentation. It doesn't solve the feature request you asked for here, but the design might become more like a firehose with a stream of all events and then in memory filtering and routing when consuming those to build different views.

PerWiklander commented 2 years ago

Oh! So more like getting an event stream and filtering in the client (the code I write) than to have n number of projections poll the database for "has anything happened on tag x since last time I checked?". That sounds... much better :-)

PerWiklander commented 2 years ago

But it looks like I misunderstood. currentEventsBySlices[MyEvent] is still per event type and takes an entityType: String (which looks like a tag), so the performance issue that is solved here is when compared to using tags for sharding, right?

Our use case is so small that we don't see a need for sharding at the moment. On the other hand we have many types of entities so we run many projections that mostly sit idle and poll for "has anything happened on this tag yet?".

PerWiklander commented 2 years ago

One more question, now that we are on the topic of akka-persistence-r2dbc:

Publish events for lower latency of eventsBySlices ... The events must still be retrieved from the database, but at a lower polling frequency, because delivery of published messages are not guaranteed.

Does this imply that I risk missing events, or that the polling (that still happens but less often) would catch missed published events?

patriknw commented 2 years ago

You would typically have one entityType for each type of entity and that concept is indeed coupled to Cluster Sharding. That said, you can use one single entityType in your system. It's just the first part of the string persistenceId. See entityTypeHint in akka.persistence.typed.PersistenceId.apply.

Does this imply that I risk missing events, or that the polling (that still happens but less often) would catch missed published events?

The latter. It will not miss events. Duplicates are automatically filtered out before they are delivered.