gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
431 stars 55 forks source link

Per event processing #2124

Open stroomdev66 opened 3 years ago

stroomdev66 commented 3 years ago

This would be a big change to the way data is processed, switching from batches of events (aka streams) to single events. It may allow for more even scaling of the processing rather than having some very large streams inundating a single node. It would require a change to the way data splitter works, breaking it into event splitting and field splitting. It would require thought as to how reprocessing would be achieved, i.e how a collections of individual events are selected for processing. It may also present performance issues if there is overhead in the xml processing of each event.

at055612 commented 7 months ago

We have the option of initially implementing a half-way house, i.e. still have streams to batch up a set of events and to act as the unit of work for pipeline tasks, but once a stream is split to events, the pipeline processing of that stream is all at the event level. This is a much smaller change than full per event processing with all its re-processing complications.

Cooked events could be stored in an event store to allow random access to events, rather than concatenating them into a single stream. They would be keyed on streamId|eventId so that events could be (re-)processed as a stream like batch. This would need thought on how to compress them at the event level, e.g. fastinfoset with dictionary. Each event could include meta data about the provenance of the event, e.g. the source stream ID, the byte/char offset range in the un-split sourcce stream, the pipeline id/ver that produced it.

We would need something that is the opposite of the event splitter to combine events together for file appending or http forwarding. i.e. for xml events, removing any wrapper elements on the individual events and adding them to the concatenated events.

Per event processing within the pipeline would potentially make the stepper a lot simpler as it is dealing with clearly defined events.

Separating event splitting from field splitting would make data splitters simpler as the field splitter is only concerned with a single 'event' to parse and the event splitters would be much more generic and re-usable, e.g. one to split text on \n.