apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.61k stars 3.55k forks source link

[C++] Allow scanner to assert an ordering and/or support implicit ordering #34698

Open westonpace opened 1 year ago

westonpace commented 1 year ago

Describe the enhancement requested

The scan node currently assigned "unordered" to its output. This is the correct ordering when the scan is done in parallel and there is no attempt at sequencing the output.

However, the scanner is capable of sequencing its output. When this is done we should assign the implicit ordering.

In addition, users should be able to assert that their data is ordered according to some ordering. For example, if a user knows their data is ordered on disk by column "x" then they should be able to assert that in the scan options. The scan node should verify this as it is scanning and report an error if it encounters unsorted data.

This later task (asserting an ordering) could also be implemented in a follow-up task if that makes things easier (there's a bit of complexity associated with verifying data is ordered as the user states).

Component(s)

C++

EnricoMi commented 3 days ago

@westonpace please see #44738 for an attempt to implement this.