Optimizer performance of APPROX_TOP_K function

StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

https://starrocks.io

Apache License 2.0

8.87k stars 1.78k forks source link

Optimizer performance of APPROX_TOP_K function #51834

Open murphyatwork opened 2 weeks ago

murphyatwork commented 2 weeks ago

Enhancement

It can be used to calculate top-k from a large dataset quickly, which is expected to be much faster than plain TOP-K.

But actually it's not, and even slower than the GROUP-BY. Calculate top-k from ssb.lineorder:

GROUP-BY: 1.413s
APPROX_TOP_K: 54.462s

MySQL root@127.1:ssb> select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) limit 5;
+-------------+----------+
| lo_orderkey | count(*) |
+-------------+----------+
| 30182       | 4        |
| 25441       | 4        |
| 13888       | 4        |
| 27172       | 4        |
| 1990        | 4        |
+-------------+----------+
5 rows in set
Time: 1.413s
MySQL root@127.1:ssb> select APPROX_TOP_K(lo_orderkey) from lineorder;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| approx_top_k(lo_orderkey)                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"item":5998592,"count":32568},{"item":5988930,"count":32568},{"item":5977440,"count":32568},{"item":5995557,"count":32564},{"item":5991074,"count":32564}] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
Time: 54.462s

alphashiv27 commented 1 week ago

Hi @murphyatwork, I'm new to this project and would love guidance on how to start. Could you assign me this issue to get going? Thank you!

alphashiv27 commented 1 week ago

Hi @murphyatwork, The output of both the queries are different and count of each item is also different. The table on which both queries ran are same. Yet the result is different. If the APPROX_TOP_K is just slow, the results should match right? Am I misunderstanding something here?

alphashiv27 commented 1 week ago

Hi @murphyatwork , Your SQL query should be select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) desc limit 5;

I would love to know the difference in the time taken if you execute this? Can you please execute and share the results?

murphyatwork commented 1 week ago

Hi @murphyatwork , Your SQL query should be select lo_orderkey, count(*) from lineorder group by lo_orderkey order by count(*) desc limit 5;

I would love to know the difference in the time taken if you execute this? Can you please execute and share the results?

hey dude, thanks for noticing this issue.

The sql: you're correct, it should be order by count(*) desc. but anyway it doesn't affect performance
Different result: yes, the result is totally different even if I correct the above sql. Considering that the APPROX_TOP_K is merely approximate, it's understandable that the result can be a little bit different, but such a result is out of my mind

Why ?

The APPROX_TOP_K is using the space-saving algorithm
The algorithm uses a frequency inheritance assumption, which would replace the least-frequent item with frequency+1. As a result, the estimated frequency can be much larger than the real value, if the dataset has similar frequency. For the ssb.lineorder.lo_orderkey, it's exactly the case
The performance: the algorithm needs to find the least-frequent item in counters each time, currently it's accomplished with a sorted-array, so it needs to be re-sorted after each update. As I know that's why it's super slow

The idea to optimize the performance:

We can use a heap-based sort to optimize the sort performance, no need to do a linear bubbling
The accuracy: we can keep more counters than required to improve the accuracy, with a little bit space waste

murphyatwork commented 1 week ago

And also, if this algorithm(space-saving) is fundamentally inaccurate, we can consider other algorithms. But before that we can focus on the performance optimization to make it practical. If you want to do some contribution, i would suggest look at the file be/src/exprs/agg/approx_top_k.h

alphashiv27 commented 1 week ago

Got it. Thanks for explaining. I will have a look into the file.

alphashiv27 commented 1 week ago

Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue.

Proposal: Optimizing Approximate Top-K Algorithm (Space Saving)

Objective:

To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm.

Changes:

Optimize Sorting with a Heap-Based Approach:

The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters. This will improve the performance of updating and managing the counters during the process() method, especially when dealing with large datasets and frequent updates.

Increase Counter Size for Better Accuracy:

Currently, the algorithm uses approximately 2 K counters, where K is the number of top elements to track. I propose increasing this number to 4 K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations.

This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy.

Expected Impact:

Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic.

Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead.

Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected.

Next Steps:

Implement the heap-based sorting within the _maintain_ordering() function.
Modify the counter allocation logic to allow for a configurable and slightly larger number of counters (i.e., 4 * K counters).
Ensure that these changes do not impact the correctness by running all relevant unit tests and adding additional tests for performance and edge cases if required.

Any feedback and suggestions are welcome before I proceed with the implementation!

murphyatwork commented 1 week ago

Hi @murphyatwork I would like to proceed with the solution using the approach you suggested. Please find my atttached proposal and assign me the issue.

Proposal: Optimizing Approximate Top-K Algorithm (Space Saving)

Objective:

To improve the performance and accuracy of the approximate top-K algorithm by optimizing the current sorting mechanism and increasing the number of counters tracked. These changes aim to optimize the execution time and improve the approximation accuracy without altering the core behavior of the algorithm.

Changes:

Optimize Sorting with a Heap-Based Approach:

The current _maintain_ordering() function relies on a linear bubble sort to keep the counters sorted by frequency. I propose replacing this with a min-heap data structure, which will allow for more efficient insertion and deletion operations, reducing the time complexity from O(n) to O(log n) when maintaining the top-K counters. This will improve the performance of updating and managing the counters during the process() method, especially when dealing with large datasets and frequent updates.

Increase Counter Size for Better Accuracy:

Currently, the algorithm uses approximately 2 K counters, where K is the number of top elements to track. I propose increasing this number to 4 K counters (or a configurable multiple of K) to improve the accuracy of the approximation. While this introduces a slight memory overhead, it will allow the algorithm to track more frequent elements, reducing the chances of incorrectly replacing important elements in memory-limited situations.

This adjustment will be made within the get_k_and_counter_num() function, keeping the allocation within reasonable limits while boosting accuracy.

Expected Impact:

Performance: The change from a linear sorting algorithm to a heap-based approach should significantly reduce the time complexity when updating counters, especially in high-frequency scenarios, without affecting the core logic.

Accuracy: Increasing the number of counters will allow the algorithm to handle more frequent items, thereby improving the precision of the top-K approximation with minimal memory overhead.

Compatibility: These optimisations will not alter the core functionality of the approximate top-K algorithm. The same input/output behaviour will be maintained, and the existing interface for the algorithm remains unchanged. All serialisation, merging, and update operations should continue to work as expected.

Next Steps:

Implement the heap-based sorting within the _maintain_ordering() function.

Modify the counter allocation logic to allow for a configurable and slightly larger number of counters (i.e., 4 * K counters).

Ensure that these changes do not impact the correctness by running all relevant unit tests and adding additional tests for performance and edge cases if required.

Any feedback and suggestions are welcome before I proceed with the implementation!

cool! can't wait for it

alphashiv27 commented 1 week ago

Hi @murphyatwork I started working on the proposed solution. However I noticed something -

The current implementation of the space-saving algorithm already employs localised bubbling (and not bubble sort) to maintain the ordering of counters efficiently. This means that when a counter's count is updated (typically incremented by 1), it swaps positions with its immediate neighbours only if necessary. Because the changes in counts are usually small, the counter moves only a few positions, if at all. This localised swapping ensures that the per-update operation is effectively O(1) in most cases.

Key Points:

- Localised Bubbling:

Minimises the number of swaps needed to maintain the sorted order of counters based on counts. Swapping is limited to immediate neighbours, reducing overhead. Efficiently handles frequent updates typical in data streams.

Per-Update Operation Complexity:

Average Case: O(1) time complexity due to minimal movement of counters. Worst Case: O(k) time complexity (where k is the number of counters), but this might never happen by the construct of the algorithm

- Optimisation Status:

The algorithm is already optimised for both time and space complexity within the constraints of the space-saving algorithm. Further optimisations at the algorithmic level (like replacing localised bubbling with a heap) are unlikely to yield significant performance gains. Using a heap could introduce additional overhead due to the complexity of maintaining the heap property during frequent count updates.(this is inevitable due to count update nature of space saving)

If you think this is incorrect we can do a quick catchup and we can discuss on this further. Would love to have a quick chat!

murphyatwork commented 1 week ago

According to my previous profiling, I can still see the bottleneck is counter maintain_ordering

-   44.99%     1.72%  pip_exec_com     starrocks_be          [.] starrocks::AggregateFunctionBatchHelper<starrocks::ApproxTopKState<(starrocks::LogicalType)5>, starrocks::ApproxTopKAggregateFunction<(starrocks::LogicalType)5, int> >::upd▒
   - 43.26% starrocks::AggregateFunctionBatchHelper<starrocks::ApproxTopKState<(starrocks::LogicalType)5>, starrocks::ApproxTopKAggregateFunction<(starrocks::LogicalType)5, int> >::update_batch_single_state                               ▒
      - 35.39% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::_maintain_ordering                                                                                                                                                     ▒
           7.21% phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<int, starrocks::ApproxTopKState<(starrocks::LogicalType)5>::Counter*>, starrocks::PhmapDefaultHashFunc<(starrocks::LogicalType)5, (starrocks::PhmapSeed)0>, phmap::▒
      - 6.72% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::process<true>                                                                                                                                                           ▒
         + 1.61% starrocks::ApproxTopKState<(starrocks::LogicalType)5>::_min_index                                                                                                                                                           ▒
        0.85% phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<int, starrocks::ApproxTopKState<(starrocks::LogicalType)5>::Counter*>, starrocks::PhmapDefaultHashFunc<(starrocks::LogicalType)5, (starrocks::PhmapSeed)0>, phmap::Equ▒
   + 1.72% start_thread

I think your analysis is mostly correct, the algorithm complexity is not a problem in the average case, but it can be in the worst case. That means if all items in the dataset have similar frequency, the counters need to be removed and added very frequently , that can be a bottleneck.

For instance:

we have counters like [{a:1},{b:1},{c:1},{d:1},{e:1}]
process a new item f, find the min counter and increase the counter, then it will be [,{b:1},{c:1},{d:1},{e:1},{f:2}]
1. this process needs to iterate and swap all counters
if a new item come in, it will repeat this process and put it in the back
so basically in this case, processing each item needs to iterate all counters
and also, in this case the frequency is totally incorrect, which is supposed to be 1, but as a result it can be NumItems

I have no quick idea about how to handle this case, looks like we need a new strategy for it.

alphashiv27 commented 1 week ago

Hi @murphyatwork Are we looking to optimize this by writing a new algorithm other than Space Saving or we are going to ignore considering the worst case is less frequent ?

murphyatwork commented 2 days ago

Hi @murphyatwork Are we looking to optimize this by writing a new algorithm other than Space Saving or we are going to ignore considering the worst case is less frequent ?

my understanding is this worse case is basically unacceptable. for instance, last week i want to use this top-n function to calculate the common-value statistics, as current implementation is a regular group-by& order-by. But with some tests I found that the accuracy of this top-n implementation is too bad to use.

And i didn't find anyone(sample from the community) is using this function. so to conclude my idea is to refactor this implementation.