apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.18k stars 1.17k forks source link

Implement spilling for PartialSortExec #9170

Open alamb opened 8 months ago

alamb commented 8 months ago

Is your feature request related to a problem or challenge?

PartialSortExec was added in https://github.com/apache/arrow-datafusion/issues/7456 / https://github.com/apache/arrow-datafusion/pull/9125

While one of the major benefits of this operator is to reduce memory required when sorting data (as it can emit early) we should also handle the case when it still can not fit everything in

Describe the solution you'd like

Add spilling support to PartialSortExec so that if it runs out of memory it will spill to disk rather than error

Describe alternatives you've considered

No response

Additional context

https://github.com/apache/arrow-datafusion/issues/9153 tracks enabling PartialSort for more queries

yyy1000 commented 8 months ago

I want to help it. Though it seems not a small project, I think there's spilling implementation in SortExec and I can learn from that.

alamb commented 8 months ago

Thanks @yyy1000 -- I would definitely recommend

  1. Studying the existing implementation in Sort
  2. Creating a test case that shows the sort being invoked (aka set memory manager low and create a partial sort plan)
  3. Try and refactor / adapt the parts used in sort to also be used in partial sort