TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.3k stars 272 forks source link

Single element stream #549

Closed LennardEc closed 8 months ago

LennardEc commented 8 months ago

Hey, I encountered a situation in which a stream is guaranteed to have only one element per timepoint, hence I was wondering if there would be a performance benefit to having a dedicated stream that is fixed to one element on the framework level. Best regards

frankmcsherry commented 8 months ago

Hello! I'm sure that could result in improved performance, but it probably would clash with this framework (which is aimed at supporting parallel workers who are uncertain about the amount of remaining work associated with each timestamp).

There are moments in differential dataflow where we use a similar property, that each worker receives at most one batch for an interval of time, but e.g. 1. that is for each worker, and across all workers there are multiple records (as many as there are workers), 2. there is at most one batch for each worker, not exactly one; I'm not sure if this applies for you as well. In these cases, we just update the operator logic to move forward in response to both progress tracking information (e.g. that there will be zero records for some interval of time) and to receiving records.

There is some ongoing work on "containers" which might connect in that rather than a Vec<D> you could use an Option<D> to send data around. However, I suspect you wouldn't notice the cost, amidst the background costs of general purpose progress tracking meant to accommodate arbitrary counts for each time.

I hope this helps! If you can see that the variable number of records is at the heart of a performance problem let me know!