Array column should support hash function

kangkaisen commented 3 years ago

Steps to reproduce the behavior (Required)

create table t0(c0 INT, c1 array<varchar(65533)>) duplicate key(c0) distributed by hash(c0) buckets 10 properties('replication_num'='1')
insert into t0 values(0, null),(1, ['x']),(1,['x']),(2, ['y']),(2, ['y'])
select count(*), count(c1), count(distinct c1) from t0

+-----------------------------------------------------------------------------------------+
| Explain String                                                                          |
+-----------------------------------------------------------------------------------------+
| WORK ON CBO OPTIMIZER                                                                   |
| PLAN FRAGMENT 0                                                                         |
|  OUTPUT EXPRS:3: count() | 4: count(2: c1) | 5: count(distinct 2: c1)                   |
|   PARTITION: UNPARTITIONED                                                              |
|                                                                                         |
|   RESULT SINK                                                                           |
|                                                                                         |
|   6:AGGREGATE (merge finalize)                                                          |
|   |  output: count(3: count()), count(4: count(2: c1)), count(5: count(distinct 2: c1)) |
|   |  group by:                                                                          |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   5:EXCHANGE                                                                            |
|      use vectorized: true                                                               |
|                                                                                         |
| PLAN FRAGMENT 1                                                                         |
|  OUTPUT EXPRS:                                                                          |
|   PARTITION: HASH_PARTITIONED: 2: c1                                                    |
|                                                                                         |
|   STREAM DATA SINK                                                                      |
|     EXCHANGE ID: 05                                                                     |
|     UNPARTITIONED                                                                       |
|                                                                                         |
|   4:AGGREGATE (update serialize)                                                        |
|   |  output: count(3: count()), count(4: count(2: c1)), count(2: c1)                    |
|   |  group by:                                                                          |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   3:AGGREGATE (merge serialize)                                                         |
|   |  output: count(3: count()), count(4: count(2: c1))                                  |
|   |  group by: 2: c1                                                                    |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   2:EXCHANGE                                                                            |
|      use vectorized: true                                                               |
|                                                                                         |
| PLAN FRAGMENT 2                                                                         |
|  OUTPUT EXPRS:                                                                          |
|   PARTITION: RANDOM                                                                     |
|                                                                                         |
|   STREAM DATA SINK                                                                      |
|     EXCHANGE ID: 02                                                                     |
|     HASH_PARTITIONED: 2: c1                                                             |
|                                                                                         |
|   1:AGGREGATE (update serialize)                                                        |
|   |  STREAMING                                                                          |
|   |  output: count(*), count(2: c1)                                                     |
|   |  group by: 2: c1                                                                    |
|   |  use vectorized: true                                                               |
|   |                                                                                     |
|   0:OlapScanNode                                                                        |
|      TABLE: t0                                                                          |
|      PREAGGREGATION: ON                                                                 |
|      partitions=1/1                                                                     |
|      rollup: t0                                                                         |
|      tabletRatio=1/1                                                                    |
|      tabletList=169413                                                                  |
|      cardinality=5                                                                      |
|      avgRowSize=1.0                                                                     |
|      numNodes=0                                                                         |
|      use vectorized: true                                                               |
+-----------------------------------------------------------------------------------------+
61 rows in set (0.02 sec)

stdpain commented 3 years ago

For the hash functions, we need to pay attention to the following points

the hash function of the data distribution and the hash function of the aggregation should not be the same, we need to reduce the conflict rate as much as possible we need to consider compatibility issues and cross-platform issues. For example, whether the hash function is generic enough to maintain compatibility with other languages, such as java for the data distribution of the hash function, crc is sufficient, in addition to the need for shuffle and aggregation of the hash function

stdpain commented 3 years ago

Here was the performace between fnv32 and murmurhash32

but fnv32 quality may not good at murmurhash32

https://quick-bench.com/q/McXn9e7Eg9S7o7cnj8SeGTKfiJ4

StarRocks / starrocks

Array column should support hash function #298

Steps to reproduce the behavior (Required)