apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579

Open Dandandan opened 1 year ago

Dandandan commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. We recently added optimizations for ORDER BY expr LIMIT by pushing limits to individual operations (saving memory, CPU time + limiting output rows) and executing sorts in parallel.

The disk spill operation in SortExec currently still assumes the to-spill disk doesn't fit in memory. However after sorting we only have to keep the batch(es) with top fetch rows and store those, which probably avoids spilling to disk.

Describe the solution you'd like We can identify that the to-spill data fits in memory after being merged / sorted and avoid spilling to disk.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

isidentical commented 1 year ago

I've been looking into this and noticed that there are currently two places where this needs to be handled (and from what I've understood, they are actually separate optimizations).

The first place is what you mentioned in this issue, which is during the spill process we sort everything we have and limit the actual output (which would mean that if we have 10 in-memory batches of limit 50, and if we can easily handle 100 records, we won't need to spill after sorting everything in place since the resulting batch would only have 50 items at most): https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L281-L282

The second place is actually where we make the first allocation. We calculate the size of the given batch, and then we assume down there that the size of the batch we got from the sort is actually the same as the size of the input batch. https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L119-L121

But thanks to the TopK PR, this is no longer the case. The sorting might actually reduce the size while we are still inserting it, so we could in theory free all the redundant memory first. This would actually help a lot, since the next insert_batch will have a lot of free-space to operate on. I've had a very rough benchmark (of a very small subset of the data from ClickHouse benchmark) and it seems like we are saving around ~320 spills (it goes from 320 spills to no spills) just by properly adjusting the used memory size.

Dandandan commented 1 year ago

Great find and analysis @isidentical!

I assumed it was already tracking the correct size of the reduced batch. Could you create a new issue for this "memory tracking bug"?

isidentical commented 1 year ago

@Dandandan definitely! I've created #3596 for it.