StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.04k stars 1.82k forks source link

[BugFix] Do not copy null_column data in fnv_hash() and crc32_hash method of NullableColumn (backport #52885) #53050

Closed mergify[bot] closed 2 days ago

mergify[bot] commented 2 days ago

Why I'm doing:

The fnv_hash() method of ArrayColumn is slow when there are NULLs in the array.

What I'm doing:

Change fnv_hash() method of NullableColumn to not copy null_column data.

Fixes #issue

fnv_hash() method of ArrayColumn is very slow when there are NULLs in the array.

SQLs to reproduce the issue

Query 1 is very fast (the input arrays do not contain NULLs)

with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN 2 ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;

Query 2 is very slow (one of the input arrays contains NULLs)

with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN NULL ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;

The original implementation copies the null_column data in fnv_hash() of NullableColumn and fnv_hash() is called per element in the array.

Benchmark

  1. Original Implementation: Query 1 (0.63 s), Query 2 (32.93 s).
  2. New Implementation: Query 1 (0.67 s), Query 2 (0.60 s).

What type of PR is this:

Does this PR entail a change in behavior?

If yes, please specify the type of change:

Checklist:

Bugfix cherry-pick branch check:

What I'm doing:

Change fnv_hash() method of NullableColumn to not copy null_column data.

Fixes #issue

fnv_hash() method of ArrayColumn is very slow when there are NULLs in the array.

SQLs to reproduce the issue

Query 1 is very fast (the input arrays do not contain NULLs)

with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN 2 ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;

Query 2 is very slow (one of the input arrays contains NULLs)

with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN NULL ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;

The original implementation copies the null_column data in fnv_hash() of NullableColumn and fnv_hash() is called per element in the array.

Benchmark

  1. Original Implementation: Query 1 (0.63 s), Query 2 (32.93 s).
  2. New Implementation: Query 1 (0.67 s), Query 2 (0.60 s).

What type of PR is this:

Does this PR entail a change in behavior?

If yes, please specify the type of change:

Checklist: