[EPIC] Improved Externalized / Spilling / Out of core Hash Aggregation

alamb commented 1 month ago

This is a collection of items to improve external (spilling) aggregation

Background

Abstract—Analytical database systems offer high-performance in-memory aggregation. If there are many unique groups, temporary query intermediates may not fit RAM, requiring the use of external storage. However, switching from an in-memory to an external algorithm can degrade performance sharply

Robust External Hash Aggregation in the Solid State Age (DuckDB external aggregation paper))

DataFusion has supported memory limited / spilling hash aggregation since @kazuyukitanimura added it last year in https://github.com/apache/datafusion/pull/7400.

We can likely improve this feature and @2010YOUY01 is considering working on it

Tasks the solution you'd like

[ ] https://github.com/apache/datafusion/pull/13090

alamb commented 1 month ago

@2010YOUY01 says in https://github.com/apache/datafusion/pull/13090#issuecomment-2437436375

Really nice paper, we can implement the same benchmark and compare in the future 😄 They implemented a unified buffer pool for both table data cache and operator (like aggregation) intermediate results, to easily support spilling in various operators. I think they didn't mention any optimization specific to the spilling part of aggregation, and just use simple LRU policy in the buffer pool. Maybe there are some spilling and merging specific optimizations we can explore (all of memory-limited aggregate/SortMergeJoin/Sort can benefit from)

DF doesn't have a buffer pool in the traditional sense, and the way arrow-rs allocates memory directly from the system allocator makes it quite hard to implement. However, I think the fact that we have arrow-rs and the arrow IPC offers lots of opportunity.

Also, are you interested in improving DataFusion's external aggregation capabilities? I think it is a non trivial gap at the moment and would be great to improve (and I would be interested in helping do so). if you are, I can start organizing the work into some tickets to see if we can get some others to check it out too

Yes, I'm start to look at related components now. Perhaps we can start with making memory-limited SQL queries more stable (e.g. more tests, make sure TPCH-SF1000 is able to run on laptop correctly), and later optimize.

I think starting with stability and then optimizing is a great idea 💯

Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins

https://github.com/apache/datafusion/issues/1599

Another potential option would be to work on running clickbench with a very small memory (100MB)?

Or maybe we could figure out another large dataset 🤔

alamb commented 1 month ago

Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins

Maybe @comphead 's work to get SMJ working in https://github.com/apache/datafusion/pull/13111 will help this (e.g. we could always use SMJ for the large TPCH queries 🤔 )

2010YOUY01 commented 4 weeks ago

Another potential option would be to work on running clickbench with a very small memory (100MB)?

This is a good idea, we should get clickbench work under memory constraints before TPCH

apache / datafusion

[EPIC] Improved Externalized / Spilling / Out of core Hash Aggregation #13123

Background

Tasks the solution you'd like