apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.02k forks source link

[Question] why need this sort in Sort.rs #2557

Open Ted-Jiang opened 2 years ago

Ted-Jiang commented 2 years ago

@yjshen , Sorry for bothering you, Why need positions.sort_unstable() below https://github.com/apache/arrow-datafusion/blob/807b7a5f7eb858e9f7162e1f00ffeeedd0bf2050/datafusion/core/src/physical_plan/sorts/sort.rs#L460-L469

I think we should keep the order of the adding index in positions, so we can keep the order from indices, the result has correctly order. https://github.com/apache/arrow-datafusion/blob/807b7a5f7eb858e9f7162e1f00ffeeedd0bf2050/datafusion/core/src/physical_plan/sorts/sort.rs#L433-L435

yjshen commented 2 years ago

group_indices takes adjacent records in the same batch as input, and does best-effort grouping to make slices instead of individual positions for better extend performance (extend a range of records rather than individual record at a time).

sort_unstable is useful since the positions are generated from lexsort which is unstable itself. there would be a possibility that records with the same sort key appear randomly after lexsort. but extend takes start pos and length as input, so a sort to make records with the same sort key appears sequentially is needed.

Ted-Jiang commented 2 years ago

i mean in this situation image positions.sort_unstable(); will get wrong order

Ted-Jiang commented 2 years ago

@yjshen But i found you pre-sort all batches during insert, https://github.com/apache/arrow-datafusion/blob/4d6428c19dd6333f31ccc12398f6720066bab3f7/datafusion/core/src/physical_plan/sorts/sort.rs#L123-L124 so this situation not exist. If i miss something plz tell me.