The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
The fnv_hash() method of ArrayColumn is slow when there are NULLs in the array.
What I'm doing:
Change fnv_hash() method of NullableColumn to not copy null_column data.
Fixes #issue
fnv_hash() method of ArrayColumn is very slow when there are NULLs in the array.
SQLs to reproduce the issue
Query 1 is very fast (the input arrays do not contain NULLs)
with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN 2 ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;
Query 2 is very slow (one of the input arrays contains NULLs)
with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN NULL ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;
The original implementation copies the null_column data in fnv_hash() of NullableColumn and fnv_hash() is called per element in the array.
Benchmark
Original Implementation: Query 1 (0.63 s), Query 2 (32.93 s).
New Implementation: Query 1 (0.67 s), Query 2 (0.60 s).
What type of PR is this:
[x] BugFix
[ ] Feature
[ ] Enhancement
[ ] Refactor
[ ] UT
[ ] Doc
[ ] Tool
Does this PR entail a change in behavior?
[ ] Yes, this PR will result in a change in behavior.
[x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
[ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
[ ] Parameter changes: default values, similar parameters but with different default values
[ ] Policy changes: use new policy to replace old one, functionality automatically enabled
[ ] Feature removed
[ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
[ ] I have added test cases for my bug fix or my new feature
[ ] This pr needs user documentation (for new or modified features or behaviors)
[ ] I have added documentation for my new feature or new function
[x] This is a backport pr
Bugfix cherry-pick branch check:
[x] I have checked the version labels which the pr will be auto-backported to the target branch
[x] 3.3
[x] 3.2
[x] 3.1
[x] 3.0
[x] 2.5
This is an automatic backport of pull request #52885 done by Mergify.
Why I'm doing:
The fnv_hash() method of ArrayColumn is slow when there are NULLs in the array.
What I'm doing:
Change fnv_hash() method of NullableColumn to not copy null_column data.
Fixes #issue
fnv_hash() method of ArrayColumn is very slow when there are NULLs in the array.
SQLs to reproduce the issue
Query 1 is very fast (the input arrays do not contain NULLs)
with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN 2 ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;
Query 2 is very slow (one of the input arrays contains NULLs)
with input as (select array_generate(1000000) as arr union all select ARRAY_MAP(x -> CASE WHEN x % 2 = 1 THEN NULL ELSE x END, array_generate(1000000)) as arr) select count(distinct arr) from input;
The original implementation copies the null_column data in fnv_hash() of NullableColumn and fnv_hash() is called per element in the array.
Benchmark
Original Implementation: Query 1 (0.63 s), Query 2 (32.93 s).
New Implementation: Query 1 (0.67 s), Query 2 (0.60 s).
What type of PR is this:
[x] BugFix
[ ] Feature
[ ] Enhancement
[ ] Refactor
[ ] UT
[ ] Doc
[ ] Tool
Does this PR entail a change in behavior?
[ ] Yes, this PR will result in a change in behavior.
[x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
[ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
[ ] Parameter changes: default values, similar parameters but with different default values
[ ] Policy changes: use new policy to replace old one, functionality automatically enabled
[ ] Feature removed
[ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
[ ] I have added test cases for my bug fix or my new feature
[ ] This pr needs user documentation (for new or modified features or behaviors)
[ ] I have added documentation for my new feature or new function
Why I'm doing:
The fnv_hash() method of ArrayColumn is slow when there are NULLs in the array.
What I'm doing:
Change fnv_hash() method of NullableColumn to not copy null_column data.
Fixes #issue
fnv_hash() method of ArrayColumn is very slow when there are NULLs in the array.
SQLs to reproduce the issue
Query 1 is very fast (the input arrays do not contain NULLs)
Query 2 is very slow (one of the input arrays contains NULLs)
The original implementation copies the null_column data in fnv_hash() of NullableColumn and fnv_hash() is called per element in the array.
Benchmark
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check:
This is an automatic backport of pull request #52885 done by Mergify.
Why I'm doing:
The fnv_hash() method of ArrayColumn is slow when there are NULLs in the array.
What I'm doing:
Change fnv_hash() method of NullableColumn to not copy null_column data.
Fixes #issue
fnv_hash() method of ArrayColumn is very slow when there are NULLs in the array.
SQLs to reproduce the issue
Query 1 is very fast (the input arrays do not contain NULLs)
Query 2 is very slow (one of the input arrays contains NULLs)
The original implementation copies the null_column data in fnv_hash() of NullableColumn and fnv_hash() is called per element in the array.
Benchmark
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist: