SolidLabResearch / Challenges

24 stars 0 forks source link

Replaying captured data streams #83

Open pbonte opened 1 year ago

pbonte commented 1 year ago

Pitch

As many applications are using data streams, a way to easily replay captured streams is necessary for demoing purposes, scalability testing or data recovery. Replaying allows to mimic the real-time behaviour of data streams, even though the data is historical data. A replayer is a crucial component to showcase how our solutions can process live data streams and how they can handle different data rates. The DAHCC dataset will be used as example to replay the data.

Desired solution

The replayer should be able to read a number of files, and stream out the events described in each file. To facility performance and scalability testing, the rates and number of streams should be configurable.

Acceptance criteria

The replayer should be a library that allows to:

Scenarios

This is part of a larger scenario

bjdmeest commented 1 year ago

@s-minoo this feels related to your rmlstreamer benchmark, can you give some pointers?

s-minoo commented 1 year ago

Yes! I wrote a data streamer/replayer in rust which will consume the historical data, and replay them with workload characteristics: periodic burst, constant rate, etc...

If you want to apply it to this challenge, take a look at the two traits of the datastreamer-rust: 1) Publisher: Responsible for inducing the data stream characteristics: periodic burst, constant rate, etc... 2) Processor: Responsible for parsing the historical data and appending timestamps to the records

Of course, with current implementation, there's quite a few limits that I can think of right now:

svrstich commented 1 year ago

Hi, shall certainly have a deeper look into what you've done to see whether it can aligned with the goals of Challenge 82/83.

svrstich commented 1 year ago

Loading from pre-configured path currently supported.

github-actions[bot] commented 1 year ago

Please provide a status update about this challenge. Every ongoing challenge needs at least one status update every 2 weeks. Thanks!

svrstich commented 1 year ago

Loading of large datasets now supported. Time-based sorting of measurements in large datasets now supported. One-step replay now supported.

pheyvaer commented 1 year ago

@svrstich Great! What is still missing?

svrstich commented 1 year ago

What's still missing? ;-) Bulk replay, tunable bulk-size replay, tunable speed replay, etc.

pheyvaer commented 1 year ago

@svrstich Do you have a complete list? So that we can assess a bit better what still needs to happen?

svrstich commented 1 year ago

Not really, but we'll be limiting it for the time being to two/three features.

svrstich commented 1 year ago

We'll stick with some example parameters for now: step-wise and everything. Other use-case specific replay option can be added later on.

svrstich commented 1 year ago

https://gitlab.ilabt.imec.be/svrstich/ldes-in-solid-semantic-observations-replay

pheyvaer commented 1 year ago

@svrstich Is this the solution for the challenge, a pointer, or work in progress?

svrstich commented 1 year ago

First alpha release :-)

github-actions[bot] commented 1 year ago

Please provide a status update about this challenge. Every ongoing challenge needs at least one status update every 2 weeks. Thanks!

pheyvaer commented 1 year ago

@RubenVerborgh Why did you remove "completion: pending" label? Stijn says that it's ready for review.

RubenVerborgh commented 1 year ago

My bad; I misunderstood what was written above!

pheyvaer commented 1 year ago

Ok, no problem! I assigned it to you now for review.

github-actions[bot] commented 1 year ago

Please provide a status update about this challenge. Every ongoing challenge needs at least one status update every 2 weeks. Thanks!