gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
431 stars 55 forks source link

Stream to stream joins #4319

Open at055612 opened 3 months ago

at055612 commented 3 months ago

From @p-kimberley :

we've got a use case where we have two separate feeds of data. We'd like to use one feed to enrich the other. Due to the 1:1 relationship between records in each feed, I figure a reference enrichment approach probably isn't the way to go about this, as it's a fairly high-volume feed.

We'd be looking at enriching events in feed A with several attributes from feed B, based on a unique identifier.

Enriching one stream with data from one or more streams in another feed will likely require a cluster wide state store so that Feed B can put the required attributes into a state store map keyed on the shared unique ID. If we can delay processing of Feed A (see #1306 ) to try to ensure the corresponding events in Feed B have been processed, then the Feed A pipeline can do a lookup on the state store to fetch the required values. The stored state could be purged after some time period, i.e. once the joins have been done.

It is possible we could achieve this with small modifications to the existing reference data functionality to fudge the strm ID, but this would mean loading large amounts of data into ref stores on every node. A global state store is preferable.