apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.56k stars 3.25k forks source link

[Proposal] Vectorization query optimization for Doris #3438

Open HappenLee opened 4 years ago

HappenLee commented 4 years ago

Motivation

At present, the underlying storage in Doris is column storage.Query execution needs to be transferred to the query layer for execution by row-to-column first. Such an implementation maybe cause the performance problem。

So we want to transform the query layer of Doris to vectorized execution so that it can be not only stored by columns but is processed by vectors (parts of columns), which allows achieving high CPU efficiency. This can benefit query performance.

Here I simply implemented a POC to verify whether there is a performance improvement

Test environment:

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 2 physical CPU package(s) 24 physical CPU core(s) 48 logical CPU(s) Identifier: Intel64 Family 6 Model 79 Stepping 1 ProcessorID: F1 06 04 00 FF FB EB BF Context Switches/Interrupts: 12174692729137 / 297015608902

Memory: 119.5 GiB/125.9 GiB



Single FE and Single BE in the same server.

###### Test:

  * Modify the logic of Doris' query layer to support the vectorized aggregation of column inventory during aggregation calculations. Record the time when the row transfer to column:

![在NewPartitionedAggregationNode之中增加计算器,并且在析构函数之中打印出来](https://upload-images.jianshu.io/upload_images/8552201-87de31d5ffa0a4d7.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Calculate the loss time of row transter to column 
![](https://upload-images.jianshu.io/upload_images/8552201-0853796248b9ada4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

* **Results**
 ```select max(C_PHONE) from customer group by C_MKTSEGMENT;```

     Statistic|Origin| Convert to column | origin(muti-thread) | Convert to column(muti-thread) 
  :-:|:-:|:-:|:-:|:-:
  Time | 4.19 Sec | 4.57 Sec - 2.17Sec (Convert Time) | 0.67 Sec |  0.69 Sec |  
  Context-Switches | 31,737 |  32,468 |  40,463 |  30,699
  Migrations | 506 | 662 | 4,920 | 3265
  Instructions | 48,890,013,173 | 47,963,367,976  | 49,111,783,565  | 48,113,904,685 
  IPC | 1.57 | 1.42 | 1.40 | 1.37
  Branches | 9,201,175,036 | 9,124,545,231  | 9,248,803,634| 9,154,186,301
  Branches-Miss % | 0.90% | 1.02%  | 0.91% | 1.02%

#### Implementation
Doris currently has a corresponding ```VectorizedRowBatch ```implementation. So we can gradually complete the optimization each exec node.

1. Starting from ```olap_scan_node```,  using vectorization query  test and observe whether there is expected performance improvement
2. ```exec_node``` need to implement method 
for ```VectorizedRowBatch``` trans to ```RowBatch``` method, retaining compatibility with the original execution logic
kangkaisen commented 4 years ago

@HappenLee I don't see the performance improve for your POC?

What's your detailed design?

Do you plan to still use RowBlockV2 and ColumnBlock?

Do you plan to still convert RowBlockV2 to RowBlock?

Why do we still use VectorizedRowBatch?

imay commented 4 years ago

@HappenLee Vectorization is a good way to accelerate execution. However in my point of view this will change all the execution framework. So before real code work, we should discuss the work in more details. Could you please explain your implementation more clearly?

HappenLee commented 4 years ago

About the POC

At present, only aggregatenode has been verified .

Count the time of aggratetation. we can find deducte the time of row conversion, the time cut in half.

Statistic origin row convert to column
Time 4.19 Sec 2.47 Sec

Implementation and Design Detail

There is three important parts: Rewriting of scannode, Rewriting of expression calculation, Rewriting of aggregatenode.

Rewrite the scanner section first:

We can make a judgment when querying the plan. If query on dup_key and only have simple expressions, the vectorization interface will be called. The pseudo code is as follows:

VectorizedBatch vbatch = _reader->next_batch_with_aggregation();
vector<rowId> selectedRowId = Expr.eval(vbatch);
convertVecBatchToRowBatch(vbatch, selectedRowId, output_rowbatch);

If there are other expressions that are not supported temporarily, continue to use the old interface. The input of new and old interfaces is different (one is row, the other is vbatch), but the output is the same (both are the original rowbatch of query layer)

Rewrite the scannode section second:

So far, the changes of scan part and expr part are basically completed. We can rewrite the aggregate node now. After rewite the olap_scan_node

Rewrite the aggregatenode section finally:

After rewite the Aggregation_node

Later, we can gradually extend the vectorization calculation to other execution nodes. It's going to be a long-term plan.

RelationShip between Rowblock, Rowblockv2, Columnblock

@kangkaisen