Open alamb opened 1 month ago
@2010YOUY01 says in https://github.com/apache/datafusion/pull/13090#issuecomment-2437436375
Really nice paper, we can implement the same benchmark and compare in the future 😄 They implemented a unified buffer pool for both table data cache and operator (like aggregation) intermediate results, to easily support spilling in various operators. I think they didn't mention any optimization specific to the spilling part of aggregation, and just use simple LRU policy in the buffer pool. Maybe there are some spilling and merging specific optimizations we can explore (all of memory-limited aggregate/SortMergeJoin/Sort can benefit from)
DF doesn't have a buffer pool in the traditional sense, and the way arrow-rs allocates memory directly from the system allocator makes it quite hard to implement. However, I think the fact that we have arrow-rs and the arrow IPC offers lots of opportunity.
Also, are you interested in improving DataFusion's external aggregation capabilities? I think it is a non trivial gap at the moment and would be great to improve (and I would be interested in helping do so). if you are, I can start organizing the work into some tickets to see if we can get some others to check it out too
Yes, I'm start to look at related components now. Perhaps we can start with making memory-limited SQL queries more stable (e.g. more tests, make sure TPCH-SF1000 is able to run on laptop correctly), and later optimize.
I think starting with stability and then optimizing is a great idea 💯
Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins
Another potential option would be to work on running clickbench with a very small memory (100MB)?
Or maybe we could figure out another large dataset 🤔
Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins
Maybe @comphead 's work to get SMJ working in https://github.com/apache/datafusion/pull/13111 will help this (e.g. we could always use SMJ for the large TPCH queries 🤔 )
Another potential option would be to work on running clickbench with a very small memory (100MB)?
This is a good idea, we should get clickbench work under memory constraints before TPCH
This is a collection of items to improve external (spilling) aggregation
Background
DataFusion has supported memory limited / spilling hash aggregation since @kazuyukitanimura added it last year in https://github.com/apache/datafusion/pull/7400.
We can likely improve this feature and @2010YOUY01 is considering working on it
Tasks the solution you'd like