brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Don't parallelize flowgraph in ways that rely on stable sorting #1204

Closed henridf closed 4 years ago

henridf commented 4 years ago

Ordered merge does not have sufficient information to sort stably: when multiple records with same sort field are present at different upstreams, it does not know what their original order (*) was.

This issue consists of removing flowgraph parallelizations that rely on stable ordered merge for deterministic output.

(*) And in some possible future situations, such as with overlapping chunks, there simply isn't any "original order"... but that will come later.

henridf commented 4 years ago

And in some possible future situations, such as with overlapping chunks, there simply isn't any "original order"... but that will come later.

Given that the notion of stable order of data doesn't exist in a model where data can imported in multiple batches, overlapping in the sort field, the a stable ordering will have to be defined by the system. A straightforward solution would be to use byte comparison (as we do for set ordering in zng).