Open Ted-Jiang opened 2 years ago
group_indices
takes adjacent records in the same batch as input, and does best-effort grouping to make slices instead of individual positions for better extend
performance (extend a range of records rather than individual record at a time).
sort_unstable
is useful since the positions are generated from lexsort
which is unstable itself. there would be a possibility that records with the same sort key appear randomly after lexsort
. but extend
takes start pos and length as input, so a sort to make records with the same sort key appears sequentially is needed.
i mean in this situation
positions.sort_unstable()
; will get wrong order
@yjshen But i found you pre-sort all batches during insert
,
https://github.com/apache/arrow-datafusion/blob/4d6428c19dd6333f31ccc12398f6720066bab3f7/datafusion/core/src/physical_plan/sorts/sort.rs#L123-L124
so this situation not exist.
If i miss something plz tell me.
@yjshen , Sorry for bothering you, Why need
positions.sort_unstable()
below https://github.com/apache/arrow-datafusion/blob/807b7a5f7eb858e9f7162e1f00ffeeedd0bf2050/datafusion/core/src/physical_plan/sorts/sort.rs#L460-L469I think we should keep the order of the adding index in
positions
, so we can keep the order fromindices
, the result has correctly order. https://github.com/apache/arrow-datafusion/blob/807b7a5f7eb858e9f7162e1f00ffeeedd0bf2050/datafusion/core/src/physical_plan/sorts/sort.rs#L433-L435