feldera / dist-design

Design documents for distributed DBSP
6 stars 1 forks source link

Are there classes of pipelines where there is no need to store state #9

Open gz opened 1 year ago

gz commented 1 year ago

https://github.com/feldera/dist-design/blob/85291465e4b4b6c6b2e528df89ef848de3f9199e/README.md?plain=1#L109

I was wondering if there are classes of pipeline applicaitons where State inside the circuit across all the workers. is cheaper to reproduce from the Output produced by the circuit in a previous step but not yet acknowledged by its destination. instead of storing it as persistent state. (e.g., what i have in mind is something like group-by aggregates for dashboards etc.) and if that's something to consider (maybe/probably not).

blp commented 1 year ago

Oh, that's interesting because it's going the wrong direction. Instead of input->state->output, you're talking about output->state. It is obviously not possible in general, since of course information is usually lost when producing output. I don't know whether there is an interesting class of applications where the state can be derived from the output.

Another possibility that has occurred to me is that there are applications where the output only depends on fairly recent input, so that it could be cheaper to reproduce the state, by replaying starting some fixed amount of time back in the input, than to save the state. This risks incorrect output, though, if we are wrong about the output only depending on recent input. (Replaying recent input seems like a disaster-recovery response to me, if the state database is lost.)