StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.01k stars 1.81k forks source link

Array column should support hash function #298

Closed kangkaisen closed 3 years ago

kangkaisen commented 3 years ago

Steps to reproduce the behavior (Required)

create table t0(c0 INT, c1 array<varchar(65533)>) duplicate key(c0) distributed by hash(c0) buckets 10 properties('replication_num'='1')
insert into t0 values(0, null),(1, ['x']),(1,['x']),(2, ['y']),(2, ['y'])
select count(*), count(c1), count(distinct c1) from t0
+-----------------------------------------------------------------------------------------+
| Explain String                                                                          |
+-----------------------------------------------------------------------------------------+
| WORK ON CBO OPTIMIZER                                                                   |
| PLAN FRAGMENT 0                                                                         |
|  OUTPUT EXPRS:3: count() | 4: count(2: c1) | 5: count(distinct 2: c1)                   |
|   PARTITION: UNPARTITIONED                                                              |
|                                                                                         |
|   RESULT SINK                                                                           |
|                                                                                         |
|   6:AGGREGATE (merge finalize)                                                          |
|   |  output: count(3: count()), count(4: count(2: c1)), count(5: count(distinct 2: c1)) |
|   |  group by:                                                                          |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   5:EXCHANGE                                                                            |
|      use vectorized: true                                                               |
|                                                                                         |
| PLAN FRAGMENT 1                                                                         |
|  OUTPUT EXPRS:                                                                          |
|   PARTITION: HASH_PARTITIONED: 2: c1                                                    |
|                                                                                         |
|   STREAM DATA SINK                                                                      |
|     EXCHANGE ID: 05                                                                     |
|     UNPARTITIONED                                                                       |
|                                                                                         |
|   4:AGGREGATE (update serialize)                                                        |
|   |  output: count(3: count()), count(4: count(2: c1)), count(2: c1)                    |
|   |  group by:                                                                          |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   3:AGGREGATE (merge serialize)                                                         |
|   |  output: count(3: count()), count(4: count(2: c1))                                  |
|   |  group by: 2: c1                                                                    |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   2:EXCHANGE                                                                            |
|      use vectorized: true                                                               |
|                                                                                         |
| PLAN FRAGMENT 2                                                                         |
|  OUTPUT EXPRS:                                                                          |
|   PARTITION: RANDOM                                                                     |
|                                                                                         |
|   STREAM DATA SINK                                                                      |
|     EXCHANGE ID: 02                                                                     |
|     HASH_PARTITIONED: 2: c1                                                             |
|                                                                                         |
|   1:AGGREGATE (update serialize)                                                        |
|   |  STREAMING                                                                          |
|   |  output: count(*), count(2: c1)                                                     |
|   |  group by: 2: c1                                                                    |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   0:OlapScanNode                                                                        |
|      TABLE: t0                                                                          |
|      PREAGGREGATION: ON                                                                 |
|      partitions=1/1                                                                     |
|      rollup: t0                                                                         |
|      tabletRatio=1/1                                                                    |
|      tabletList=169413                                                                  |
|      cardinality=5                                                                      |
|      avgRowSize=1.0                                                                     |
|      numNodes=0                                                                         |
|      use vectorized: true                                                               |
+-----------------------------------------------------------------------------------------+
61 rows in set (0.02 sec)
stdpain commented 3 years ago

For the hash functions, we need to pay attention to the following points

the hash function of the data distribution and the hash function of the aggregation should not be the same, we need to reduce the conflict rate as much as possible we need to consider compatibility issues and cross-platform issues. For example, whether the hash function is generic enough to maintain compatibility with other languages, such as java for the data distribution of the hash function, crc is sufficient, in addition to the need for shuffle and aggregation of the hash function

stdpain commented 3 years ago

Here was the performace between fnv32 and murmurhash32

but fnv32 quality may not good at murmurhash32

https://quick-bench.com/q/McXn9e7Eg9S7o7cnj8SeGTKfiJ4