facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.54k stars 1.17k forks source link

Optimize sort #6766

Open skadilover opened 1 year ago

skadilover commented 1 year ago

Description

We are replacing presto in the system with velox, but we have done a lot of self-research and optimization on performance based on presto. We found that there is a performance gap between directly using velox to replace presto and using our optimized presto. In the specific Sort scenario, we concluded after analysis that it may be caused by the misalignment of the processing methods in the following places::

  1. Null value check When the 'null value' in velox is stored in RowContainer, it is stored in the form of null flags + T() default value. When comparing, nullFlags will be read first, and then the data will be read. The improvement we made is that we use the min/max of T according to compareFlags when storing, and directly read the data when comparing, eliminating the overhead of reading the null flag in one step.

  2. Prefix sort When compare(), the data is read directly from the underlying data structure. Whether it is Presto or Velox, there is some additional memory access overhead. In Presto, a compare requires reading data in different columns, while in Velox, the data Stored in a row-base manner, in addition to the sort key, the RowContainer also contains null flags, normalize data, and payload. In this way, the data required for sorting (sort key) is stored in a larger memory block than actually required. There will be additional cache misses. The optimization we did was to introduce a prefix column in PageIndex/RowContainer, which is a vector< long > or vector< byte[] >: 1) For Byte/Integer/Bigint, it is directly stored in the prefix 2) For Varchar, take the first 8 bytes of the string and store them in prefix 3) For the case of multiple columns, use a fixed-length byte[] to store a prefix. If Byte/Integer/Bigint is written into the prefix, when the first Varchar is encountered, the bytes of the remaining length of the prefix are written into the prefix. Also ignore remaining sort keys When compare(), compare the prefix first. If the prefix is ​​equal, then read the data from the underlying data structure. If the prefix has a higher selectivity, the performance of sort will be improved.

  3. Range Sort In our system, we previously implemented "range sort" based on the column-base data structure in PagesIndex (presto). The general logic is: Suppose there is a sort statement order by a, b, c. First sort column a. If there are repeated data in column a, use the repeated data to construct several ranges, and then sort column b. When sorting, you only need to sort the items in the range, and then process column c in the same way. However, this algorithm is based on the column-base data structure. Currently, RowContainer stores data in row-base. Can RowContainer support column-base storage?

To summarize the issues :

  1. Are the above improvements to Sort reasonable?
  2. Currently, velox’s sort implementation is simply std::sort. Does velox have plans to optimize sort performance in the future?
  3. In practice, how should we quickly incorporate optimization suggestions into velox?

Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing

mbasmanova commented 1 year ago

CC: @oerling @pedroerp

mbasmanova commented 1 year ago

Hi @skadilover, welcome to the community. Would you introduce yourself? Where are you from?

The optimizations you described above sound interesting. Are you interested in trying them out, iterating and eventually incorporating them into Velox?

skadilover commented 1 year ago

@mbasmanova I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs. Very glad that the community is interested in these ideas. Add something that was not mentioned in the above description:

  1. The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.
  2. Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.

Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?

mbasmanova commented 1 year ago

I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.

Welcome. What's your name? I only see GItHub handle skadilover.

I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.

mbasmanova commented 1 year ago

Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.

Fantastic. Looking forward to the PR.

mbasmanova commented 1 year ago

Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?

Check out velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp

This is a benchmark for reduce_agg aggregate function which benchmarks a mini query (table scan + aggregation). I suggest you write similar one to benchmark table scan + order-by. One may comment that this doesn't benchmark order by in isolation, but that's Ok. After all there is no point to speed a portion of the query that's not using significant amount of CPU time to begin with. E2e query speed up is what we are after.

Hope this helps.

mbasmanova commented 1 year ago

The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.

This makes a lot of sense. Excited to see this optimization coming to Velox.

CC: @amitkdutta @aditi-pandit

skadilover commented 1 year ago

I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.

Welcome. What's your name? I only see GItHub handle skadilover.

My name is Heng Jiang.

I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.

Not available in the prestodb repo, these are internal enhancements.

skadilover commented 1 year ago

Fantastic. Looking forward to the PR.

Next week is our National Day holiday (10 days). After the holiday, I will continue to organize the code and submit the PR as soon as possible

mbasmanova commented 1 year ago

@skadilover Nice to meet you, Heng. Enjoy the holiday. Looking forward to hearing again from you in a couple of weeks.

zhouyuan commented 1 year ago

@mbasmanova @skadilover here's one related patch on s/std::sort/timsort. https://github.com/facebookincubator/velox/pull/6745 TimSort behaviors very well on real-world data, here's a good intro(https://hackernoon.com/timsort-the-fastest-sorting-algorithm-youve-never-heard-of-36b28417f399) We got up to ~2x performance on customer workloads.

thanks, -yuan

pedroerp commented 1 year ago

Hi guys, sorry about the delay to follow up on this thread. I've recently helped review this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:

https://arxiv.org/pdf/2209.08420.pdf

This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.

A few specific questions:

  1. Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.

  2. "The improvement we made is that we use the min/max of T according to compareFlags when storing, " How does this work if you actually have the min/max values as part of the data?

  3. "Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?

skadilover commented 1 year ago

Hi guys, sorry about the delay to follow up on this thread. I've recently helped reviewed this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:

https://arxiv.org/pdf/2209.08420.pdf

@pedroerp Glad to hear your response for this issue!~ The paper mentioned above is very useful.

PrefixSort is not just a normalizekey technology: it is different from velox’s existing storage structure (RowContainer ) It hopes to encode the sort key separately like duckdb. At the same time, in order to avoid the performance degradation of the swap method in scenarios where the sort keys are particularly long, we only encode the first few columns in the form of prefix.

In order to facilitate communication, I submitted a PR https://github.com/facebookincubator/velox/pull/7230 (it is not 100% ready yet, some code will be adjusted based on my E2E test results), and the design document is also being prepared (it will take a few days).If have time, could you help me take a look at this PR first? @mbasmanova @pedroerp

skadilover commented 1 year ago
  1. Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.

@pedroerp Yes, for some specific columns we may refer to the offset-value encoding format, but this work may be done later. I hope to optimize the single-column bigint/varchar scenario first, especially in the bigint scenario. radix-sort

  1. "The improvement we made is that we use the min/max of T according to compareFlags when storing, " How does this work if you actually have the min/max values as part of the data?

Directly store min/max instead of T() (what velox does now)

  1. "Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?

Like question 1, I think this is a encoding issue. I will discuss it scene by scene in the subsequent PRs.

skadilover commented 1 year ago

This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.

@pedroerp
At present, I can only verify the optimization results through E2E testing (our framework is similar to presto-cpp). The sort optimization results are affected by the size of the data set. I am currently using a 1T tpch table. Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

pedroerp commented 1 year ago

Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?

https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp

skadilover commented 1 year ago

Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?

https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp

Thanks a lot, This example looks like it could work. I will try it soon

mbasmanova commented 1 year ago

@skadilover Wondering if you had a chance to review Arrow Row format: https://docs.rs/arrow-row/latest/arrow_row/index.html

"Rows are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post."

mbasmanova commented 1 year ago

Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf

skadilover commented 1 year ago

Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf

@mbasmanova Thanks for the references above.

The encoding tech (include "normalized key", "memory compare" etc. ) can be very efficiently , and we have implemented them in our inernal presto before. In PrefixSort optimization, I hope to implement them step by step. This week I plan to prepare a design document (including some research E2E tests and benchmarks) to describe my plan in detail.

mbasmanova commented 1 year ago

@skadilover Looking forward to hearing more from you on this topic. Thanks.

skadilover commented 1 year ago

Overview

PrefixSort's optimization plan includes two optimization points: normalized keys optimization and cpu cache optimization.

Normalized Key For Multi Sort Keys

The core idea of ​​normalized key is: assuming that sort keys can be encoded into a binary string and the sorting semantics remain unchanged, then all sort keys can be regarded as a whole for sorting (see [1]). At this time, memcmp can also use simd to speed up. In the submitted PR [https://github.com/facebookincubator/velox/pull/7230] implementation, in the scenario of two columns of BIGINT, we can get a performance improvement of 200%-300%. image The benchmark test results are as follows: image

PrefixSort currently supports BIGINT, and the current research results of other fixed-length types are that they can be implemented. The VARCHAR scenario will be a little complicated. We also studied the implementation of duckDB, and the following somewhat special method was used in its implementation. Range sort is used to process varchar. Further research and testing of the encoding format will be conducted to improve this part of the design: image

Memory Data Structure

In order to implement the above-mentioned Normalized Key technology in velox, we consider redesigning a data structure to store the normalized key for the following two reasons:

  1. The Sort algorithm requires an iterator with "Random Access Iterator" semantics see[3][4], and the RowContainer data currently uses non-contiguous allocation to store row data, which is implemented on the basis of non-contiguous memory. Random Access Iterator has performance problems and requires complex pointer mapping.
  2. Currently, the normalized field in RowContainer does not serve for sort and does not contain compareFlags semantic information. In addition, the processing of null values ​​will also affect the implementation of the compare method.

First, we review the current OrderBy operator implementation of velox, as shown in the figure below: image

The general process of implementation is:

  1. Get the std::vector<char*> returnRows of the row pointer through the listRow() method of RowContainer. The reason for this step is that the sort algorithm requires a contiguous memory block to facilitate the implementation of the iterator.
  2. Give the begin/end iterator of returnRows to std::stort for sorting. The logic of the custom compare method provided for std::vector is: first get the row ptr from returnRows, and then use ptr to access the data in the RowContainer. From the perspective of CPU cache, the problem with this implementation is: Research on the sort algorithm basically has an assumption: the memory access operation is constant and low-time consuming. This assumption is wrong in engineering practice. In engineering practice, we need to consider the issue of memory access efficiency. Here is a classic example see [2]: image

Going back to the implementation of the OrderBy operator, we can find that the compare method in the sort loop needs to access two memory blocks (address vector and row container) at the same time. As the memory grows, memory access problems will significantly affect the efficiency of sort execution. . The solution to this problem is to store the sort key and row Num separately. Take duckDB as an example: image In this solution, the performance of the compare method is improved, while the performance of the swap method will decrease (swapping all sort keys will be slower than swapping only row addresses). When the number of sort columns increases to a certain extent, the overhead of swap will exceed the benefits of the compare method. , so there needs to be a rollback mechanism here. The design of PrefixSort is as follows: image Use row ptr to replace row num in the implementation of PrefixSort. The purpose of this is:

  1. There can be a fallback mechanism when the size of sort keys is greater than a certain value.
  2. At the same time, using row ptr can also better adapt to the current velox operator implementation and facilitate subsequent output operations.
  3. Data in RowConatiner can be reused without re-storing payload data.

In the E2E test of single column BIGINT, we can get 30%-40% gain. In fact, it will be faster if we use reintercept_cast to get the value, but for the unified implementation of normalize key, memcmp is used here for comparison. In long scenarios, memcmp is slower than reintercept_cast. In the implementation, template memcmp in duckDB is used to replace std::memcmp. There is some improvement in the effect, but it is still slower than the implementation of reintercept_cast. In addition, directly using std::vecotr to store keys during prototype testing can achieve greater performance improvements. However, considering the maintenance cost of the overall code and the possibility of changing the subsequent sort algorithm to RadixSort, this is not currently possible to consider this optimization:

image

Conclusion and plan

By optimizing the memory data structure and using Normalize technology, PrefixSort will have performance improvements compared to the original implementation in both single-column and multi-column cases. It is planned to be divided into 4 stages to achieve complete functions.

How To Reproduce Optimization

Just Run PrefixSortBenchmark.cpp && OrderByTest.cpp

reference:

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf [2] https://courses.cs.washington.edu/courses/cse373/21sp/lectures/lecture24.pdf [3] https://en.cppreference.com/w/cpp/iterator/random_access_iterator [4] https://en.cppreference.com/w/cpp/algorithm/sort

skadilover commented 1 year ago

@mbasmanova updates for the topic. also pr : https://github.com/facebookincubator/velox/pull/7230

mbasmanova commented 1 year ago

@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.

skadilover commented 1 year ago

@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.

yes ,there is a copy

mbasmanova commented 1 year ago

@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key. Also, can you extend the design to support complex types as well?

skadilover commented 1 year ago

@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key.

  1. If the type of column that needs to be sorted can be normalized and is small fix size such as bigint, I think there is no need to limit the size of the Prefix. The reason is that as seen in the benchmark results, the execution time in this scenario does not increase significantly as the number of columns increases.

  2. If only one part can be normalized, such as String_view, the remaining parts need to be compared using pointers. From a design perspective, if the remaining columns can be normlized, we can continue memcmp after comparing using pointers.But this goes against our original intention of treating Prefix as a whole and using memcmp.In this scenario I think it would be better to only store the first part normalized column in prefix.When VARCHAR scenarios are supported in the future, choise will be made based on benchmark results.Currently in code design, we can flexibly define behavior through PrefixSortLayout(holding relevant information) For variable-width types , we have two choise as below , I prefer the plan B right now, the reason is that in real scenarios there may not be so many columns that need to be sorted image

  3. In some special scenarios, our customers know the data distribution. For example, the first column that needs to be sorted has high selectivity, and there is no need to store the remaining sorted columns in Prefix (especially if there are many columns left). , we hope that our customers can control the size of Prefix through SQL Hint, leaving space for optimization for specific scenarios, and this can also be applied to some adaptive technologies

  4. swap(unlike use row ptr, we swap the whole prefix), copy and continuous allocation may involve overhead,this also need to put a hard limit on the "prefix" of the normalized key.

Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.

Also, can you extend the design to support complex types as well?

If necessary, we can store some potential sorting prefixes in Prefix and use row ptr to compare(simply call RowContainer`s compare method). Like Varchar, if complex type cannot obtain benefits through memcmp, I feel that it is possible to use only pointers for comparison will be better .

@mbasmanova

skadilover commented 1 year ago

@mbasmanova the pr closed above is not used, stil use https://github.com/facebookincubator/velox/pull/7230 to merge code.

skadilover commented 1 year ago

@mbasmanova Is there more questions about the topic ? Could you help me to review the code?

mbasmanova commented 1 year ago

@skadilover Thank you for working on this. Took a quick look and left some initial comments on the PR.

mbasmanova commented 1 year ago

@skadilover

Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.

I think we need to have a sensible limit by default. I'm also interested in figuring out support for varchar and row() keys as these are common.

skadilover commented 1 year ago

I think we need to have a sensible limit by default.

I will check it later.

I'm also interested in figuring out support for varchar and row() keys as these are common.

Our previous method of dealing with varchar was to 1. analyze the maximum length of the data before sorting and then process it according to a fixed size, 2. only retain the prefix for those that exceed the length limit. I hope to find a better way. Currently I am still doing relevant research work.

@mbasmanova

skadilover commented 1 year ago

@mbasmanova As we discussed before, I have started to split the original PR into 3 sub PRs and merge them according to the dependencies. Could you help me to review this PR first?

yingsu00 commented 7 months ago

@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.

skadilover commented 7 months ago

@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.

The core idea of this opt is 'use binary string to opt mulit columns sort', use quick-sort just for speeding up things.

jinchengchenghh commented 5 months ago

We will also try to optimize sort, initial proposal is: Currently, the common usage is to request space for each row. Firstly, we can request all the rows memory at once, and newRow is only for fixed size column, maybe we can also optimize string column at the same way.

for (auto row = 0; row < input->size(); ++row) {
    char* newRow = inputData_->newRow();

    for (auto i = 0; i < inputs_.size(); ++i) {
      inputData_->store(decodedInputs_[i], row, newRow, i);
    }

    addNewRow(groups[row], newRow);
  }

Secondly, currently we will spill input for memory pressure, we can also request all the memory in the addInput function, and test if we can get the memory before request, if not, try to sort by prefixsort and spill output as spark does, sort will not request any memory in that case. What Spark does is:

  1. Insert record and check if the memory is enough, if not, sort and spill https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L491
  2. Spill and sort: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L234

Will add a sort benchmark and sort spill benchmark to track the performance. First PR is https://github.com/facebookincubator/velox/pull/10041 Please let me know if it is acceptable. Thanks!

skadilover commented 5 months ago

@jinchengchenghh
I think this feature you proposed has no conflict with the optimization of prefix-sort. I think you can file a separate issue and let the code maintainer know, which may make things progress faster.

jinchengchenghh commented 5 months ago

Yes, because this title is Optimize sort, same topic with mine, so I add it here, And I would like to apply the second optimization after prefix sort merged. Otherwise it will make it hard to enable prefixsort because it requires sort does not require memory.