Open murphyatwork opened 2 weeks ago
Hi @murphyatwork, I'm new to this project and would love guidance on how to start. Could you assign me this issue to get going? Thank you!
Hi @murphyatwork, The output of both the queries are different and count of each item is also different. The table on which both queries ran are same. Yet the result is different. If the APPROX_TOP_K is just slow, the results should match right? Am I misunderstanding something here?
Hi @murphyatwork ,
Your SQL query should be
select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) desc limit 5;
I would love to know the difference in the time taken if you execute this? Can you please execute and share the results?
Hi @murphyatwork , Your SQL query should be
select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) desc limit 5;
I would love to know the difference in the time taken if you execute this? Can you please execute and share the results?
hey dude, thanks for noticing this issue.
order by count(*) desc
. but anyway it doesn't affect performanceAPPROX_TOP_K
is merely approximate, it's understandable that the result can be a little bit different, but such a result is out of my mindWhy ?
APPROX_TOP_K
is using the space-saving algorithmfrequency+1
. As a result, the estimated frequency can be much larger than the real value, if the dataset has similar frequency. For the ssb.lineorder.lo_orderkey
, it's exactly the caseThe idea to optimize the performance:
And also, if this algorithm(space-saving) is fundamentally inaccurate, we can consider other algorithms. But before that we can focus on the performance optimization to make it practical.
If you want to do some contribution, i would suggest look at the file be/src/exprs/agg/approx_top_k.h
Got it. Thanks for explaining. I will have a look into the file.
Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue.
To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm.
The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters. This will improve the performance of updating and managing the counters during the process() method, especially when dealing with large datasets and frequent updates.
Currently, the algorithm uses approximately 2 K counters, where K is the number of top elements to track. I propose increasing this number to 4 K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations.
This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy.
Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic.
Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead.
Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected.
Any feedback and suggestions are welcome before I proceed with the implementation!
Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue.
Proposal: Optimizing Approximate Top-K Algorithm (Space Saving)
Objective:
To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm.
Changes:
Optimize Sorting with a Heap-Based Approach:
The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters. This will improve the performance of updating and managing the counters during the process() method, especially when dealing with large datasets and frequent updates.
Increase Counter Size for Better Accuracy:
Currently, the algorithm uses approximately 2 K counters, where K is the number of top elements to track. I propose increasing this number to 4 K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations.
This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy.
Expected Impact:
Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic.
Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead.
Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected.
Next Steps:
- Implement the heap-based sorting within the _maintain_ordering() function.
- Modify the counter allocation logic to allow for a configurable and slightly larger number of counters (i.e., 4 * K counters).
- Ensure that these changes do not impact the correctness by running all relevant unit tests and adding additional tests for performance and edge cases if required.
Any feedback and suggestions are welcome before I proceed with the implementation!
cool! can't wait for it
Hi @murphyatwork I started working on the proposed solution. However I noticed something -
The current implementation of the space-saving algorithm already employs localised bubbling (and not bubble sort) to maintain the ordering of counters efficiently. This means that when a counter's count is updated (typically incremented by 1), it swaps positions with its immediate neighbours only if necessary. Because the changes in counts are usually small, the counter moves only a few positions, if at all. This localised swapping ensures that the per-update operation is effectively O(1) in most cases.
Key Points:
Minimises the number of swaps needed to maintain the sorted order of counters based on counts. Swapping is limited to immediate neighbours, reducing overhead. Efficiently handles frequent updates typical in data streams.
Average Case: O(1) time complexity due to minimal movement of counters. Worst Case: O(k) time complexity (where k is the number of counters), but this might never happen by the construct of the algorithm
The algorithm is already optimised for both time and space complexity within the constraints of the space-saving algorithm. Further optimisations at the algorithmic level (like replacing localised bubbling with a heap) are unlikely to yield significant performance gains. Using a heap could introduce additional overhead due to the complexity of maintaining the heap property during frequent count updates.(this is inevitable due to count update nature of space saving)
If you think this is incorrect we can do a quick catchup and we can discuss on this further. Would love to have a quick chat!
According to my previous profiling, I can still see the bottleneck is counter maintain_ordering
- 44.99% 1.72% pip_exec_com starrocks_be [.] starrocks::AggregateFunctionBatchHelper<starrocks::ApproxTopKState<(starrocks::LogicalType)5>, starrocks::ApproxTopKAggregateFunction<(starrocks::LogicalType)5, int> >::upd▒
- 43.26% starrocks::AggregateFunctionBatchHelper<starrocks::ApproxTopKState<(starrocks::LogicalType)5>, starrocks::ApproxTopKAggregateFunction<(starrocks::LogicalType)5, int> >::update_batch_single_state ▒
- 35.39% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::_maintain_ordering ▒
7.21% phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<int, starrocks::ApproxTopKState<(starrocks::LogicalType)5>::Counter*>, starrocks::PhmapDefaultHashFunc<(starrocks::LogicalType)5, (starrocks::PhmapSeed)0>, phmap::▒
- 6.72% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::process<true> ▒
+ 1.61% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::_min_index ▒
0.85% phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<int, starrocks::ApproxTopKState<(starrocks::LogicalType)5>::Counter*>, starrocks::PhmapDefaultHashFunc<(starrocks::LogicalType)5, (starrocks::PhmapSeed)0>, phmap::Equ▒
+ 1.72% start_thread
I think your analysis is mostly correct, the algorithm complexity is not a problem in the average case, but it can be in the worst case. That means if all items in the dataset have similar frequency, the counters need to be removed and added very frequently , that can be a bottleneck.
For instance:
[{a:1},{b:1},{c:1},{d:1},{e:1}]
min
counter and increase the counter, then it will be [,{b:1},{c:1},{d:1},{e:1},{f:2}]
1
, but as a result it can be NumItems
I have no quick idea about how to handle this case, looks like we need a new strategy for it.
Hi @murphyatwork Are we looking to optimize this by writing a new algorithm other than Space Saving or we are going to ignore considering the worst case is less frequent ?
Hi @murphyatwork Are we looking to optimize this by writing a new algorithm other than Space Saving or we are going to ignore considering the worst case is less frequent ?
my understanding is this worse case is basically unacceptable. for instance, last week i want to use this top-n function to calculate the common-value statistics, as current implementation is a regular group-by& order-by. But with some tests I found that the accuracy of this top-n implementation is too bad to use.
And i didn't find anyone(sample from the community) is using this function. so to conclude my idea is to refactor this implementation.
Enhancement
It can be used to calculate top-k from a large dataset quickly, which is expected to be much faster than plain TOP-K.
But actually it's not, and even slower than the GROUP-BY. Calculate top-k from ssb.lineorder: