Replaying captured data streams

pbonte commented 1 year ago

Pitch

As many applications are using data streams, a way to easily replay captured streams is necessary for demoing purposes, scalability testing or data recovery. Replaying allows to mimic the real-time behaviour of data streams, even though the data is historical data. A replayer is a crucial component to showcase how our solutions can process live data streams and how they can handle different data rates. The DAHCC dataset will be used as example to replay the data.

Desired solution

The replayer should be able to read a number of files, and stream out the events described in each file. To facility performance and scalability testing, the rates and number of streams should be configurable.

Acceptance criteria

The replayer should be a library that allows to:

read multiple files or directory,
stream out the result, possible using multiple output streams (configurable)
configure the replay frequency, i.e. how fast the events should be streamed out,
configure burst in the replay frequency, i.e. a certain times the event rate rises drastically to mimic high load/usage periods
assign new time stamps to the replayed events(the property should be configurable)
configure the event shape (to know which triples belong to each event)
Optionally first map raw data, e.g. CSV, to RDF

Scenarios

This is part of a larger scenario

bjdmeest commented 1 year ago

@s-minoo this feels related to your rmlstreamer benchmark, can you give some pointers?

s-minoo commented 1 year ago

Yes! I wrote a data streamer/replayer in rust which will consume the historical data, and replay them with workload characteristics: periodic burst, constant rate, etc...

If you want to apply it to this challenge, take a look at the two traits of the datastreamer-rust: 1) Publisher: Responsible for inducing the data stream characteristics: periodic burst, constant rate, etc... 2) Processor: Responsible for parsing the historical data and appending timestamps to the records

Of course, with current implementation, there's quite a few limits that I can think of right now:

Can only use a scaling factor to change stream rate
Cannot stream according to the timestamps provided inside the historical data: (will need coordination between the publisher and the processor traits)
Will need to change the main.rs for new processors, and publishers (a refactoring needed to dynamically load processors and publishers)