Open skadilover opened 1 year ago
CC: @oerling @pedroerp
Hi @skadilover, welcome to the community. Would you introduce yourself? Where are you from?
The optimizations you described above sound interesting. Are you interested in trying them out, iterating and eventually incorporating them into Velox?
@mbasmanova I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs. Very glad that the community is interested in these ideas. Add something that was not mentioned in the above description:
Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?
I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.
Welcome. What's your name? I only see GItHub handle skadilover.
I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.
Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.
Fantastic. Looking forward to the PR.
Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?
Check out velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp
This is a benchmark for reduce_agg aggregate function which benchmarks a mini query (table scan + aggregation). I suggest you write similar one to benchmark table scan + order-by. One may comment that this doesn't benchmark order by in isolation, but that's Ok. After all there is no point to speed a portion of the query that's not using significant amount of CPU time to begin with. E2e query speed up is what we are after.
Hope this helps.
The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.
This makes a lot of sense. Excited to see this optimization coming to Velox.
CC: @amitkdutta @aditi-pandit
I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.
Welcome. What's your name? I only see GItHub handle skadilover.
My name is Heng Jiang.
I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.
Not available in the prestodb repo, these are internal enhancements.
Fantastic. Looking forward to the PR.
Next week is our National Day holiday (10 days). After the holiday, I will continue to organize the code and submit the PR as soon as possible
@skadilover Nice to meet you, Heng. Enjoy the holiday. Looking forward to hearing again from you in a couple of weeks.
@mbasmanova @skadilover here's one related patch on s/std::sort/timsort. https://github.com/facebookincubator/velox/pull/6745 TimSort behaviors very well on real-world data, here's a good intro(https://hackernoon.com/timsort-the-fastest-sorting-algorithm-youve-never-heard-of-36b28417f399) We got up to ~2x performance on customer workloads.
thanks, -yuan
Hi guys, sorry about the delay to follow up on this thread. I've recently helped review this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:
https://arxiv.org/pdf/2209.08420.pdf
This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.
A few specific questions:
Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.
"The improvement we made is that we use the min/max of T according to compareFlags when storing, " How does this work if you actually have the min/max values as part of the data?
"Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?
Hi guys, sorry about the delay to follow up on this thread. I've recently helped reviewed this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:
https://arxiv.org/pdf/2209.08420.pdf
@pedroerp Glad to hear your response for this issue!~ The paper mentioned above is very useful.
PrefixSort is not just a normalizekey technology: it is different from velox’s existing storage structure (RowContainer ) It hopes to encode the sort key separately like duckdb. At the same time, in order to avoid the performance degradation of the swap method in scenarios where the sort keys are particularly long, we only encode the first few columns in the form of prefix.
In order to facilitate communication, I submitted a PR https://github.com/facebookincubator/velox/pull/7230 (it is not 100% ready yet, some code will be adjusted based on my E2E test results), and the design document is also being prepared (it will take a few days).If have time, could you help me take a look at this PR first? @mbasmanova @pedroerp
- Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.
@pedroerp Yes, for some specific columns we may refer to the offset-value encoding format, but this work may be done later. I hope to optimize the single-column bigint/varchar scenario first, especially in the bigint scenario. radix-sort
- "The improvement we made is that we use the min/max of T according to compareFlags when storing, " How does this work if you actually have the min/max values as part of the data?
Directly store min/max instead of T() (what velox does now)
- "Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?
Like question 1, I think this is a encoding issue. I will discuss it scene by scene in the subsequent PRs.
This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.
@pedroerp
At present, I can only verify the optimization results through E2E testing (our framework is similar to presto-cpp). The sort optimization results are affected by the size of the data set. I am currently using a 1T tpch table. Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job
Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job
This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?
https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp
Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job
This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?
https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp
Thanks a lot, This example looks like it could work. I will try it soon
@skadilover Wondering if you had a chance to review Arrow Row format: https://docs.rs/arrow-row/latest/arrow_row/index.html
"Rows are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post."
Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf
Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf
@mbasmanova Thanks for the references above.
The encoding tech (include "normalized key", "memory compare" etc. ) can be very efficiently , and we have implemented them in our inernal presto before. In PrefixSort optimization, I hope to implement them step by step. This week I plan to prepare a design document (including some research E2E tests and benchmarks) to describe my plan in detail.
@skadilover Looking forward to hearing more from you on this topic. Thanks.
PrefixSort's optimization plan includes two optimization points: normalized keys optimization and cpu cache optimization.
The core idea of normalized key is: assuming that sort keys can be encoded into a binary string and the sorting semantics remain unchanged, then all sort keys can be regarded as a whole for sorting (see [1]). At this time, memcmp can also use simd to speed up. In the submitted PR [https://github.com/facebookincubator/velox/pull/7230] implementation, in the scenario of two columns of BIGINT, we can get a performance improvement of 200%-300%. The benchmark test results are as follows:
PrefixSort currently supports BIGINT, and the current research results of other fixed-length types are that they can be implemented. The VARCHAR scenario will be a little complicated. We also studied the implementation of duckDB, and the following somewhat special method was used in its implementation. Range sort is used to process varchar. Further research and testing of the encoding format will be conducted to improve this part of the design:
In order to implement the above-mentioned Normalized Key technology in velox, we consider redesigning a data structure to store the normalized key for the following two reasons:
First, we review the current OrderBy operator implementation of velox, as shown in the figure below:
The general process of implementation is:
Going back to the implementation of the OrderBy operator, we can find that the compare method in the sort loop needs to access two memory blocks (address vector and row container) at the same time. As the memory grows, memory access problems will significantly affect the efficiency of sort execution. . The solution to this problem is to store the sort key and row Num separately. Take duckDB as an example: In this solution, the performance of the compare method is improved, while the performance of the swap method will decrease (swapping all sort keys will be slower than swapping only row addresses). When the number of sort columns increases to a certain extent, the overhead of swap will exceed the benefits of the compare method. , so there needs to be a rollback mechanism here. The design of PrefixSort is as follows: Use row ptr to replace row num in the implementation of PrefixSort. The purpose of this is:
In the E2E test of single column BIGINT, we can get 30%-40% gain. In fact, it will be faster if we use reintercept_cast
By optimizing the memory data structure and using Normalize technology, PrefixSort will have performance improvements compared to the original implementation in both single-column and multi-column cases. It is planned to be divided into 4 stages to achieve complete functions.
Just Run PrefixSortBenchmark.cpp && OrderByTest.cpp
[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf [2] https://courses.cs.washington.edu/courses/cse373/21sp/lectures/lecture24.pdf [3] https://en.cppreference.com/w/cpp/iterator/random_access_iterator [4] https://en.cppreference.com/w/cpp/algorithm/sort
@mbasmanova updates for the topic. also pr : https://github.com/facebookincubator/velox/pull/7230
@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.
@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.
yes ,there is a copy
@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key. Also, can you extend the design to support complex types as well?
@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key.
If the type of column that needs to be sorted can be normalized and is small fix size such as bigint, I think there is no need to limit the size of the Prefix. The reason is that as seen in the benchmark results, the execution time in this scenario does not increase significantly as the number of columns increases.
If only one part can be normalized, such as String_view, the remaining parts need to be compared using pointers. From a design perspective, if the remaining columns can be normlized, we can continue memcmp after comparing using pointers.But this goes against our original intention of treating Prefix as a whole and using memcmp.In this scenario I think it would be better to only store the first part normalized column in prefix.When VARCHAR scenarios are supported in the future, choise will be made based on benchmark results.Currently in code design, we can flexibly define behavior through PrefixSortLayout(holding relevant information) For variable-width types , we have two choise as below , I prefer the plan B right now, the reason is that in real scenarios there may not be so many columns that need to be sorted
In some special scenarios, our customers know the data distribution. For example, the first column that needs to be sorted has high selectivity, and there is no need to store the remaining sorted columns in Prefix (especially if there are many columns left). , we hope that our customers can control the size of Prefix through SQL Hint, leaving space for optimization for specific scenarios, and this can also be applied to some adaptive technologies
swap(unlike use row ptr, we swap the whole prefix), copy and continuous allocation may involve overhead,this also need to put a hard limit on the "prefix" of the normalized key.
Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.
Also, can you extend the design to support complex types as well?
If necessary, we can store some potential sorting prefixes in Prefix and use row ptr to compare(simply call RowContainer`s compare method). Like Varchar, if complex type cannot obtain benefits through memcmp, I feel that it is possible to use only pointers for comparison will be better .
@mbasmanova
@mbasmanova the pr closed above is not used, stil use https://github.com/facebookincubator/velox/pull/7230 to merge code.
@mbasmanova Is there more questions about the topic ? Could you help me to review the code?
@skadilover Thank you for working on this. Took a quick look and left some initial comments on the PR.
@skadilover
Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.
I think we need to have a sensible limit by default. I'm also interested in figuring out support for varchar and row(
I think we need to have a sensible limit by default.
I will check it later.
I'm also interested in figuring out support for varchar and row() keys as these are common.
Our previous method of dealing with varchar was to 1. analyze the maximum length of the data before sorting and then process it according to a fixed size, 2. only retain the prefix for those that exceed the length limit. I hope to find a better way. Currently I am still doing relevant research work.
@mbasmanova
@mbasmanova As we discussed before, I have started to split the original PR into 3 sub PRs and merge them according to the dependencies. Could you help me to review this PR first?
@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.
@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.
The core idea of this opt is 'use binary string to opt mulit columns sort', use quick-sort just for speeding up things.
We will also try to optimize sort, initial proposal is:
Currently, the common usage is to request space for each row.
Firstly, we can request all the rows memory at once, and newRow
is only for fixed size column, maybe we can also optimize string column at the same way.
for (auto row = 0; row < input->size(); ++row) {
char* newRow = inputData_->newRow();
for (auto i = 0; i < inputs_.size(); ++i) {
inputData_->store(decodedInputs_[i], row, newRow, i);
}
addNewRow(groups[row], newRow);
}
Secondly, currently we will spill input for memory pressure, we can also request all the memory in the addInput function, and test if we can get the memory before request, if not, try to sort by prefixsort and spill output
as spark does, sort will not request any memory in that case.
What Spark does is:
Will add a sort benchmark and sort spill benchmark to track the performance. First PR is https://github.com/facebookincubator/velox/pull/10041 Please let me know if it is acceptable. Thanks!
@jinchengchenghh
I think this feature you proposed has no conflict with the optimization of prefix-sort.
I think you can file a separate issue and let the code maintainer know, which may make things progress faster.
Yes, because this title is Optimize sort, same topic with mine, so I add it here, And I would like to apply the second optimization after prefix sort merged. Otherwise it will make it hard to enable prefixsort because it requires sort does not require memory.
Description
We are replacing presto in the system with velox, but we have done a lot of self-research and optimization on performance based on presto. We found that there is a performance gap between directly using velox to replace presto and using our optimized presto. In the specific Sort scenario, we concluded after analysis that it may be caused by the misalignment of the processing methods in the following places::
Null value check When the 'null value' in velox is stored in RowContainer, it is stored in the form of null flags + T() default value. When comparing, nullFlags will be read first, and then the data will be read. The improvement we made is that we use the min/max of T according to compareFlags when storing, and directly read the data when comparing, eliminating the overhead of reading the null flag in one step.
Prefix sort When compare(), the data is read directly from the underlying data structure. Whether it is Presto or Velox, there is some additional memory access overhead. In Presto, a compare requires reading data in different columns, while in Velox, the data Stored in a row-base manner, in addition to the sort key, the RowContainer also contains null flags, normalize data, and payload. In this way, the data required for sorting (sort key) is stored in a larger memory block than actually required. There will be additional cache misses. The optimization we did was to introduce a prefix column in PageIndex/RowContainer, which is a vector< long > or vector< byte[] >: 1) For Byte/Integer/Bigint, it is directly stored in the prefix 2) For Varchar, take the first 8 bytes of the string and store them in prefix 3) For the case of multiple columns, use a fixed-length byte[] to store a prefix. If Byte/Integer/Bigint is written into the prefix, when the first Varchar is encountered, the bytes of the remaining length of the prefix are written into the prefix. Also ignore remaining sort keys When compare(), compare the prefix first. If the prefix is equal, then read the data from the underlying data structure. If the prefix has a higher selectivity, the performance of sort will be improved.
Range Sort In our system, we previously implemented "range sort" based on the column-base data structure in PagesIndex (presto). The general logic is: Suppose there is a sort statement order by a, b, c. First sort column a. If there are repeated data in column a, use the repeated data to construct several ranges, and then sort column b. When sorting, you only need to sort the items in the range, and then process column c in the same way. However, this algorithm is based on the
column-base
data structure. Currently, RowContainer stores data inrow-base
. Can RowContainer support column-base storage?To summarize the issues :
Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing