Per event processing - Githubissues

We have the option of initially implementing a half-way house, i.e. still have streams to batch up a set of events and to act as the unit of work for pipeline tasks, but once a stream is split to events, the pipeline processing of that stream is all at the event level. This is a much smaller change than full per event processing with all its re-processing complications.

Cooked events could be stored in an event store to allow random access to events, rather than concatenating them into a single stream. They would be keyed on streamId|eventId so that events could be (re-)processed as a stream like batch. This would need thought on how to compress them at the event level, e.g. fastinfoset with dictionary. Each event could include meta data about the provenance of the event, e.g. the source stream ID, the byte/char offset range in the un-split sourcce stream, the pipeline id/ver that produced it.

We would need something that is the opposite of the event splitter to combine events together for file appending or http forwarding. i.e. for xml events, removing any wrapper elements on the individual events and adding them to the concatenated events.

Per event processing within the pipeline would potentially make the stepper a lot simpler as it is dealing with clearly defined events.

Separating event splitting from field splitting would make data splitters simpler as the field splitter is only concerned with a single 'event' to parse and the event splitters would be much more generic and re-usable, e.g. one to split text on \n.

gchq / stroom

Per event processing #2124