facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.52k stars 1.15k forks source link

min and max aggregate functions should reject complex type inputs with nulls #6314

Open mbasmanova opened 1 year ago

mbasmanova commented 1 year ago

Bug description

In Presto, min and max check for null array elements, map values and struct fields and throw, but in Velox they don't.

Found after extending AggregationFuzzer to test with complex type inputs: #6312

CC: @laithsakka @bikramSingh91

System information

n/a

Relevant logs

No response

duanmeng commented 1 year ago

@mbasmanova Can I take this issue?

mbasmanova commented 1 year ago

@duanmeng Thank you for offering to help with this. Please, feel free to work on it. Here is a draft of a fix I have: https://github.com/facebookincubator/velox/compare/main...mbasmanova:fix-min-max?expand=1

I'm thinking that we can change SingleValueAccumulator::compare to fail if there is a null inside a row, array or map. To do that I introduced ContainerRowSerde::compareWithNulls which returns unset std::optional if there a null. This is similar to BaseVector::compare.

Would appreciate if you could take over. Thanks.

duanmeng commented 1 year ago

@duanmeng Thank you for offering to help with this. Please, feel free to work on it. Here is a draft of a fix I have: https://github.com/facebookincubator/velox/compare/main...mbasmanova:fix-min-max?expand=1

I'm thinking that we can change SingleValueAccumulator::compare to fail if there is a null inside a row, array or map. To do that I introduced ContainerRowSerde::compareWithNulls which returns unset std::optional if there a null. This is similar to BaseVector::compare.

Would appreciate if you could take over. Thanks.

@mbasmanova Got it, thanks for your guidance, I will start to investigate and work on it ASAP.

mbasmanova commented 1 year ago

@duanmeng Thanks. Very much appreciate the help.

mbasmanova commented 1 year ago

This issue affects min_by and max_by functions with complex types for the "compare" parameter. These functions currently do not allow complex types for "compare", but they should. Once we enable complex types for "compare" we'll need to make sure these functions fail if complex values contain nulls inside. Having ContainerRowSerde::compareWithNulls which returns unset std::optional if there a null will be helpful.

mbasmanova commented 1 year ago

Same applies to multimap_agg function.

mbasmanova commented 1 year ago

Same applies to set_agg function.

duanmeng commented 1 year ago

Split this fix into 3 tasks to make the PRs simple, cohesive and comfortable to review, cc @mbasmanova

duanmeng commented 1 year ago

Same applies to set_agg function.

@mbasmanova I reproduced the issue you mentioned,

presto> SELECT set_agg(col1) FROM (VALUES (1, array [1, 2, 3]), (2, array [null, 3, 1]), (2, array [null, 3, 1])) AS tbl(col0, col1);
Query 20230918_072245_00001_fggjk failed: ARRAY comparison not supported for arrays with null elements

These should be related to AddressableNonNullValueList::equalTo, which uses ContainerRowSerde::compare(ByteStream& left, ByteStream& right, ...). Maybe we need to make this API support StopAtNull in case of nest nulls too.

Same applies to multimap_agg function.

This seems to only happen when the key type is a complex type, but can multimap_agg support complex type key?

presto> SELECT multimap_agg(col0, col1) FROM (VALUES (1, array [1, 2, 3]), (2, array [null, 3, 1]), (2, array [null, 3, 1])) AS tbl(col0, col1);
                      _col0
-------------------------------------------------
 {1=[[1, 2, 3]], 2=[[null, 3, 1], [null, 3, 1]]}
(1 row)

presto> SELECT multimap_agg(col0, col1) FROM (VALUES (array [1,2], array [1, 2, 3]), (array [2, 3], array [null, 3, 1]), (array[2, null], array [null, 3, 1])) AS tbl(col0, col1);
Query 20230918_075359_00022_fggjk failed: ARRAY comparison not supported for arrays with null elements

presto> SELECT multimap_agg(col0, col1) FROM (VALUES (array [1,2], array [1, 2, 3]), (array [2, 3], array [null, 3, 1]), (array[2, 1], array [null, 3, 1])) AS tbl(col0, col1);
Query is gone (server restarted?)
mbasmanova commented 1 year ago

@duanmeng

These should be related to AddressableNonNullValueList::equalTo, which uses ContainerRowSerde::compare(ByteStream& left, ByteStream& right, ...). Maybe we need to make this API support StopAtNull in case of nest nulls too.

Makes sense.

can multimap_agg support complex type key?

Yes.

mbasmanova commented 1 year ago

@duanmeng I'm seeing fuzzer failures on PRs:

https://app.circleci.com/pipelines/github/facebookincubator/velox/33757/workflows/40bbeaa0-704f-422b-b8e0-2510435412c3/jobs/219292

Reason: MAP comparison not supported for values that contain nulls

Would you help take a look?

duanmeng commented 1 year ago

https://app.circleci.com/pipelines/github/facebookincubator/velox/33757/workflows/40bbeaa0-704f-422b-b8e0-2510435412c3/jobs/219292

Reason: MAP comparison not supported for values that contain nulls

Would you help take a look?

@mbasmanova I've reproduced it locally. This should be caused by the input complex type vector generated by fuzzer that contains nest nulls. Can we fix this by adding some kind of constraint to the fuzzer?

image
mbasmanova commented 1 year ago

Let's understand this a bit more first. Why is the fuzzer failing? It should not fail as long as all equivalent query plans succeed or all fail. Are we seeing some plans succeed, but some fail? Is this expected?

mbasmanova commented 1 year ago

See AggregationFuzzer::testPlan

duanmeng commented 1 year ago

See AggregationFuzzer::testPlan

@mbasmanova Ok got it, let me investigate it more thoroughly.

duanmeng commented 1 year ago

@mbasmanova Hi Masha, I may find the root cause of the fuzzer failures according to your suggestion. In short the plan with the shape of Partial -> local exchange -> final aggregation may trigger the nested nulls compare and throws.

The fuzzer failed in iteration 27 on testing plan #3, but testing #[0, 1, 2] all succeeded in this iteration. In theory, they should all be successful since their input splits were the same and applied min to the same column a0. But the strange thing is that only testing plan #3 failed.

This is because testing plan #3 has a key difference from others, which is partitioning the input values vectors and applying min on them. It may compare with nested nulls inside a partition vector and return std::nullopt triggering the exception.

For example, the following SQL would succeed by applying directly to all inputs.

presto> select min(col0) from (values (array [1, 2]), (array [2, 3]), (array [3, null]), (array [3, 4])) as tbl(col0);
 _col0
--------
 [1, 2]
(1 row)

But it will fail if I partition them into two aggregates and merge them like testing plan #3

presto> select min(col0) from (values (array [1, 2]), (array [2, 3])) as tbl(col0);
 _col0
--------
 [1, 2]
(1 row)

presto> select min(col0) from (values (array [3, null]), (array [3, 4])) as tbl(col0);
Query 20230919_044628_00012_6yp5g failed: ARRAY comparison not supported for arrays with null elements

Details

testing plan in iteration 27

I20230918 18:58:23.271420 206026 AggregationFuzzer.cpp:697] ==============================> Started iteration 27 (seed: 253123116)

I20230918 18:58:23.565044 206026 AggregationFuzzer.cpp:246] Testing plan #0
I20230918 18:58:23.565090 206026 AggregationFuzzer.cpp:802] Executing query plan: 
-- Aggregation[SINGLE a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
  -- Values[1000 rows in 10 vectors] -> c0:MAP<VARCHAR,INTEGER>

I20230918 18:58:23.577270 206026 AggregationFuzzer.cpp:246] Testing plan #1
I20230918 18:58:23.577284 206026 AggregationFuzzer.cpp:802] Executing query plan: 
-- Aggregation[FINAL a0 := min("a0")] -> a0:MAP<VARCHAR,INTEGER>
  -- Aggregation[PARTIAL a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
    -- Values[1000 rows in 10 vectors] -> c0:MAP<VARCHAR,INTEGER>

I20230918 18:58:23.595597 206026 AggregationFuzzer.cpp:254] Testing plan #2 with spilling
I20230918 18:58:23.595623 206026 AggregationFuzzer.cpp:802] Executing query plan: 
-- Aggregation[FINAL a0 := min("a0")] -> a0:MAP<VARCHAR,INTEGER>
  -- Aggregation[INTERMEDIATE a0 := min("a0")] -> a0:MAP<VARCHAR,INTEGER>
    -- Aggregation[PARTIAL a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
      -- Values[1000 rows in 10 vectors] -> c0:MAP<VARCHAR,INTEGER>

I20230918 18:58:23.602247 206026 AggregationFuzzer.cpp:246] Testing plan #3
I20230918 18:58:23.602262 206026 AggregationFuzzer.cpp:802] Executing query plan: 
-- Aggregation[FINAL a0 := min("a0")] -> a0:MAP<VARCHAR,INTEGER>
  -- LocalPartition[GATHER] -> a0:MAP<VARCHAR,INTEGER>
    -- Aggregation[PARTIAL a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
      -- Values[300 rows in 3 vectors] -> c0:MAP<VARCHAR,INTEGER>
    -- Aggregation[PARTIAL a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
      -- Values[300 rows in 3 vectors] -> c0:MAP<VARCHAR,INTEGER>
    -- Aggregation[PARTIAL a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
      -- Values[200 rows in 2 vectors] -> c0:MAP<VARCHAR,INTEGER>
    -- Aggregation[PARTIAL a0 := min(ROW["c0"])] -> a0:MAP<VARCHAR,INTEGER>
      -- Values[200 rows in 2 vectors] -> c0:MAP<VARCHAR,INTEGER>
mbasmanova commented 1 year ago

@duanmeng Thank you for investigating and explaining your findings. I believe I understand what's going on.

Processing a single stream of values [1, 2], [2, 3], [3, null], [3, 4] goes like this:

But processing 2 streams: ([1, 2], [2, 3]) and ([3, null], [3, 4]) goes differently. When processing 2nd stream:

This behavior is problematic because when a query runs on a cluster it can go either way and therefore same query may either succeed or fail. Failures are intermittent and non-deterministic.

It seems to be this is an existing problem in Presto as well.

I think we need to make sure that query either always fails or always succeeds.

To make the query always fail, we would add a check that all inputs to min/max (and other similar functions) do not contain nested nulls.

I don't see a way to make the query always succeed as this would require to delay processing of inputs with nested nulls until FINAL aggregation stage and increase shuffle size. E.g. we could modify the intermediate results to contain min of values without nested nulls + a list of values with nested nulls. Then final aggregation would first process values without nested nulls, then process values with nested nulls to see if they can be ruled out without needing to compare the nulls.

Let's file an issue in prestodb repo and fix Velox to always fail if input contains values with nested nulls.

duanmeng commented 1 year ago

Let's file an issue in prestodb repo and fix Velox to always fail if input contains values with nested nulls.

@mbasmanova Yes, always fails if the input contains values with nested nulls in StopAtNull mode is more semantically clear. I will draft a PR to remove the early return logic in ContainerRowSerder's complex type compare methods, and check netsted nulls on both sides.

mbasmanova commented 1 year ago

@duanmeng I'm thinking that it will be faster to check input vectors for nested nulls before serializing them to the accumulator. There is BaseVector::containsNullAt(index) method. One can loop over input vector and call this method. This is slow though. Using DecodedVector will be faster. There are examples of creating hierarchy of DecodedVectors to match the hierarchy of complex type vector in velox/functions/prestosql/aggregates/PrestoHasher.h and velox/row/CompactRow.h.

CC: @laithsakka

duanmeng commented 1 year ago

@duanmeng I'm thinking that it will be faster to check input vectors for nested nulls before serializing them to the accumulator. There is BaseVector::containsNullAt(index) method. One can loop over input vector and call this method. This is slow though. Using DecodedVector will be faster. There are examples of creating hierarchy of DecodedVectors to match the hierarchy of complex type vector in velox/functions/prestosql/aggregates/PrestoHasher.h and velox/row/CompactRow.h.

@mbasmanova OK got it, thanks.

mbasmanova commented 1 year ago

@duanmeng Thank you for helping fix this issue for min/max/set_agg. I believe there are a few functions that still need fixing:

Essentially, every scalar or aggregate function that compares complex types needs to check if there are nested nulls and fail.

CC: @aditi-pandit

duanmeng commented 1 year ago

@duanmeng Thank you for helping fix this issue for min/max/set_agg. I believe there are a few functions that still need fixing:

@mbasmanova I see, let me update the task list in https://github.com/facebookincubator/velox/issues/6314#issuecomment-1713553081

duanmeng commented 1 year ago

@mbasmanova I make a new checklist here to be convenient to track,

PS: I've checked set_union when I fix set_agg, the Presto seems to support set_union with nested nulls.

mbasmanova commented 1 year ago

@duanmeng Thanks.

PS: I've checked set_union when I fix set_agg, the Presto seems to support set_union with nested nulls.

Hmm... here is what I see:

presto:di> select set_union(array[array[x]]) from unnest(array[null, 1, null]) as t(x);

Query 20231005_122121_10284_ve9i5, FAILED, 1 node
Splits: 6 total, 0 done (0.00%)
112ms [0 rows, 0B] [0 rows/s, 0B/s]

Query 20231005_122121_10284_ve9i5 failed: ARRAY comparison not supported for arrays with null elements
duanmeng commented 1 year ago

@mbasmanova I reproduced it too, with an array of array and null compares.

presto> SELECT set_union(col0) FROM (VALUES (array[array [1]]), (array[array[null]]), (array[array[1]])) AS tbl(col0);
     _col0
---------------
 [[1], [null]]
(1 row)
presto> SELECT set_union(col0) FROM (VALUES (array[array [1]]), (array[array[null]]), (array[array[null]])) AS tbl(col0);
Query 20231005_125107_00053_b2wrq failed: ARRAY comparison not supported for arrays with null elements
presto> SELECT set_union(col0) FROM (VALUES (array[array [1]]), (array[array[null]]), (array[array[1, null]])) AS tbl(col0);
          _col0
--------------------------
 [[1], [null], [1, null]]
(1 row)
presto> SELECT set_union(col0) FROM (VALUES (array[array [1]]), (array[array[1, null]]), (array[array[1, null]])) AS tbl(col0);
Query 20231005_125140_00055_b2wrq failed: ARRAY comparison not supported for arrays with null elements
ashokku2022 commented 1 year ago

@mbasmanova thank you for providing us the opportunity to work on some of these issues, highly appreciate it. @duanmeng After talking with Aditi, we are planning to pick the following 4 to assist and if needed we can take more. Please let me know. @Krishna-Prasad-P-V as discussed, please start looking into these and communicate with @duanmeng to avoid overlap on the functions being worked between you two.

 array_duplicates, array_has_duplicates
 array_intersect, arrays_overlap
 array_remove, contains, array_position
 array_max, array_min
Krishna-Prasad-P-V commented 1 year ago

@duanmeng I started working on array_min. As I understand the below given is the current working with Velox.

presto:tpch> select array_min(col0) from (values (array [1, 2]), (array [2, 3]), (array [3, null]), (array [3, 4])) as tbl(col0);
 _col0 
-------
     1 
     2 
 NULL  
     3 
(4 rows)

We should avoid NULLs within complex type (ROW, ARRAY and MAP) and throw error for these functions. This will be the same case with nested-NULLs as well. Could you please confirm if my understanding is right?

duanmeng commented 1 year ago

We should avoid NULLs within complex type (ROW, ARRAY and MAP) and throw error for these functions. This will be the same case with nested-NULLs as well. Could you please confirm if my understanding is right?

@Krishna-Prasad-P-V Yes, we should check the nested nulls in the ComplexType. For array_min, I believe we should do the checking when the array elements type is a ComplexType. The following example maybe more appropriate,

presto> select array_min(col0) from (values (array [array[1, 2], array[1, null]])) as tbl(col0);
Query 20231009_073123_00002_xv6np failed: ARRAY comparison not supported for arrays with null elements
Krishna-Prasad-P-V commented 1 year ago
 select array_min(col0) from (values (array [array[1, 2], array[1, null]])) as tbl(col0);

@duanmeng Thanks for your reply. Currently, we don't support array[array[]] signature in Velox. Should this be enabled as well?

presto:tpch> select array_min(col0) from (values (array [array[1, 2], array[1, null]])) as tbl(col0);
Query 20231009_105518_00042_n39ws failed:  Scalar function presto.default.array_min not registered with arguments: (ARRAY<ARRAY<INTEGER>>). Found function registered with the following signatures:
((array(date)) -> date)
((array(varchar)) -> varchar)
((array(double)) -> double)
((array(real)) -> real)
((array(smallint)) -> smallint)
((array(integer)) -> integer)
((array(bigint)) -> bigint)
((array(boolean)) -> boolean)
((array(hugeint)) -> hugeint)
((array(timestamp)) -> timestamp)
((array(tinyint)) -> tinyint)
duanmeng commented 1 year ago

Currently, we don't support array[array[]] signature in Velox. Should this be enabled as well?

@Krishna-Prasad-P-V It appears that Velox does not implement the complex type version of the ArrayFunctions. cc @mbasmanova

Krishna-Prasad-P-V commented 1 year ago

@duanmeng - Can I look into multimap_agg function as we have blockers in ArrayFunctions and waiting for a reply from Masha ? Let me know. Thanks.

duanmeng commented 1 year ago

@duanmeng - Can I look into multimap_agg function as we have blockers in ArrayFunctions and waiting for a reply from Masha ? Let me know. Thanks.

@Krishna-Prasad-P-V Sure, thanks.

pramodsatya commented 1 year ago

Hi @duanmeng, can I work on the min_by, max_by functions?

duanmeng commented 1 year ago

@pramodsatya I am working on it (set_agg, min/max_by, map_agg, array_sort), how about map_union and map_union_sum?

pramodsatya commented 1 year ago

Got it, sure I will work on map_union and map_union_sum. Thanks!

pramodsatya commented 1 year ago

For map_union and map_union_sum, keys of complex type in the map cannot contain NULL and Presto throws the following error:

select map_union(x) from (values (map(array[array[1,null]], array[array[null]])), (map(array[array[2]], array[array[2]]))) as t(x);
Query 20231010_230244_00034_dk9j2 failed: map key cannot be indeterminate: [1, null]

select map_union(map(array[array[array[null, 1]]], array[2]));
Query 20231010_231320_00035_dk9j2 failed: map key cannot be indeterminate: [[null, 1]]

select map_union(map(array[row(null, 1)], array[2]));
Query 20231010_215836_00018_dk9j2 failed: map key cannot be indeterminate: [null, 1]

Presto supports maps having complex types (with NULLs) in the values field for map_union aggregate and no error is seen:

select map_union(x) from (select map(array[1], array[row(null,2)]) x);
            _col0            
-----------------------------
 {1={field0=null, field1=2}} 
(1 row)

select map_union(map(array[array[1],array[2]],array[array[array[1,2]],array[array[null]]]));
            _col0             
------------------------------
 {[1]=[[1, 2]], [2]=[[null]]} 
(1 row)

For map_union_sum, complex types are not supported in the values field:

select map_union_sum(map(array[1],array[2]),map(array[2],array[null]));
Query 20231010_220027_00019_dk9j2 failed: line 1:8: Unexpected parameters (map(integer,integer), map(integer,unknown)) for function map_union_sum. Expected: map_union_sum(map(K,V)) K:comparable, V:nonDecimalNumeric

Presto native has the same behavior as Presto for all these queries. Could you please share a query that reproduces this issue for these two map aggregates, @duanmeng ?

duanmeng commented 1 year ago

Presto native has the same behavior as Presto for all these queries.

@pramodsatya Hi, I haven't started to investigate these functions yet. According to your test, map_union/map_union_sum may not have this issue.

Krishna-Prasad-P-V commented 1 year ago

@duanmeng - I tried to work on multimap_agg and was able to fix it like you did for map_agg. Could you please confirm if this change looks fine? I can raise a PR.

presto:tpch> select multimap_agg(x,y) from (values (2, array[3,1,null]), (1, array[9,2,7,null])) AS tbl(x,y);
                  _col0                  
-----------------------------------------
 {1=[[9, 2, 7, null]], 2=[[3, 1, null]]} 
(1 row)

presto:tpch> select multimap_agg(x,y) from (values (2, array[3,1,null]), (1, array[9,2,7,null])) AS tbl(x,y);
Query 20231011_102917_00012_b7zgv failed: !decoded.base()->containsNullAt(indices[index]) ARRAY comparison not supported for values that contain nulls

presto:tpch> select multimap_agg(x,y) from (values (2, array[array[3,1,null
]]), (1, array[array[9,2,7,null]])) AS tbl(x,y);
Query 20231011_103432_00015_b7zgv failed: !decoded.base()->containsNullAt(indices[index]) ARRAY comparison not supported for values that contain nulls
mbasmanova commented 1 year ago

It appears that Velox does not implement the complex type version of the ArrayFunctions.

@Krishna-Prasad-P-V Can you check if Presto supports these types?

aditi-pandit commented 1 year ago

For map_union and map_union_sum, keys of complex type in the map cannot contain NULL and Presto throws the following error:

select map_union(x) from (values (map(array[array[1,null]], array[array[null]])), (map(array[array[2]], array[array[2]]))) as t(x);
Query 20231010_230244_00034_dk9j2 failed: map key cannot be indeterminate: [1, null]

select map_union(map(array[array[array[null, 1]]], array[2]));
Query 20231010_231320_00035_dk9j2 failed: map key cannot be indeterminate: [[null, 1]]

select map_union(map(array[row(null, 1)], array[2]));
Query 20231010_215836_00018_dk9j2 failed: map key cannot be indeterminate: [null, 1]

Presto supports maps having complex types (with NULLs) in the values field for map_union aggregate and no error is seen:

select map_union(x) from (select map(array[1], array[row(null,2)]) x);
            _col0            
-----------------------------
 {1={field0=null, field1=2}} 
(1 row)

select map_union(map(array[array[1],array[2]],array[array[array[1,2]],array[array[null]]]));
            _col0             
------------------------------
 {[1]=[[1, 2]], [2]=[[null]]} 
(1 row)

For map_union_sum, complex types are not supported in the values field:

select map_union_sum(map(array[1],array[2]),map(array[2],array[null]));
Query 20231010_220027_00019_dk9j2 failed: line 1:8: Unexpected parameters (map(integer,integer), map(integer,unknown)) for function map_union_sum. Expected: map_union_sum(map(K,V)) K:comparable, V:nonDecimalNumeric

Presto native has the same behavior as Presto for all these queries. Could you please share a query that reproduces this issue for these two map aggregates, @duanmeng ?

@pramodsatya : Please can you confirm from the query plans that these queries are executed on the Prestissimo worker and not at the co-ordinator. Since the queries are using all constants, I'm wondering if they are constant folded.

Krishna-Prasad-P-V commented 1 year ago

It appears that Velox does not implement the complex type version of the ArrayFunctions.

@Krishna-Prasad-P-V Can you check if Presto supports these types?

@mbasmanova - Hi Masha. It looks like they are supported in Presto, but only some of them support nested complex-type support. I tried these queries to confirm the same in Presto.

presto:tiny> select array_duplicates((array [array[1, 2], array[1, null], array[1, 2]]));
  _col0
----------
 [[1, 2]]
(1 row)

Query 20231012_100435_00010_kjj6a, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
[Latency: client-side: 199ms, server-side: 177ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tiny> select array_has_duplicates((array [array[1, 2], array[1, null], array[1, 2]]));
 _col0
-------
 true
(1 row)

Query 20231012_100549_00011_kjj6a, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
[Latency: client-side: 169ms, server-side: 153ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tiny> select array_intersect(array[array[1,null], array[1,null]]);
   _col0
-----------
 [1, null]
(1 row)

Query 20231012_112646_00028_kjj6a, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
[Latency: client-side: 99ms, server-side: 88ms] [0 rows, 0B] [0 rows/s, 0B/s]

But some of the queries are now failing as expected in both Presto and Velox with the changes from @duanmeng .

presto:tiny> select arrays_overlap(array[array[1,null]], array[array[1,null]]);
Query 20231012_113022_00029_kjj6a failed: ARRAY comparison not supported for arrays with null elements

presto:tiny> select array_min(array [array[1, 2], array[1, null]]);
Query 20231012_113941_00037_kjj6a failed: ARRAY comparison not supported for arrays with null elements

presto:tiny> select array_max(array [array[1, 2], array[1, null]]);
Query 20231012_114003_00038_kjj6a failed: ARRAY comparison not supported for arrays with null elements

There are Presto functions which don't use nested complex-types as well

presto:tiny> select array_remove(array[array[1,null]],1);
Query 20231012_113343_00030_kjj6a failed: line 1:8: Unexpected parameters (array(array(integer)), integer) for function array_remove. Expected: array_remove(array(E), E) E:comparable
select array_remove(array[array[1,null]],1)

presto:tiny> select contains(array[array[1,null]],1);
Query 20231012_113629_00032_kjj6a failed: line 1:8: Unexpected parameters (array(array(integer)), integer) for function contains. Expected: contains(array(T), T) T:comparable
select contains(array[array[1,null]],1)

presto:tiny> select array_position(array[array[1,null]],1);
Query 20231012_113800_00035_kjj6a failed: line 1:8: Unexpected parameters (array(array(integer)), integer) for function array_position. Expected: array_position(array(T), T) T:comparable, array_position(array(T), T, bigint) T:comparable
select array_position(array[array[1,null]],1)
mbasmanova commented 1 year ago

@Krishna-Prasad-P-V

I expect all functions that need to compare complex type values would fail if inputs contain nested nulls.

That's because all these functions use ArrayType.equalTo/compareTo, MapType.equalTo and RowType.compareTo methods. A slight gotcha is that many of these functions do not check for nested nulls up front, but only as needed. Hence, they sometimes throw and sometime they don't. See example in https://github.com/prestodb/presto/issues/20965

Notice, that a different query using array_duplicates throws.

presto:di> select array_duplicates((array [array[1, 2], array[1, null], array[1, null]]));

Query 20231012_131819_50523_divi2 failed: map key cannot be null or contain nulls

It would be good to implement consistent semantics in Velox, i.e. functions that need to compare inputs always throw if inputs contain nested nulls.

Hope this helps.

Krishna-Prasad-P-V commented 1 year ago

@duanmeng - Could you please review this changes - https://github.com/facebookincubator/velox/pull/7067

duanmeng commented 1 year ago

@duanmeng - I tried to work on multimap_agg and was able to fix it like you did for map_agg. Could you please confirm if this change looks fine? I can raise a PR.

presto:tpch> select multimap_agg(x,y) from (values (2, array[3,1,null]), (1, array[9,2,7,null])) AS tbl(x,y);
                  _col0                  
-----------------------------------------
 {1=[[9, 2, 7, null]], 2=[[3, 1, null]]} 
(1 row)

presto:tpch> select multimap_agg(x,y) from (values (2, array[3,1,null]), (1, array[9,2,7,null])) AS tbl(x,y);
Query 20231011_102917_00012_b7zgv failed: !decoded.base()->containsNullAt(indices[index]) ARRAY comparison not supported for values that contain nulls

presto:tpch> select multimap_agg(x,y) from (values (2, array[array[3,1,null
]]), (1, array[array[9,2,7,null]])) AS tbl(x,y);
Query 20231011_103432_00015_b7zgv failed: !decoded.base()->containsNullAt(indices[index]) ARRAY comparison not supported for values that contain nulls

@Krishna-Prasad-P-V Hi Krishna, could you check this comment from @mbasmanova https://github.com/facebookincubator/velox/issues/6314#issuecomment-1757735513

Looks like Presto supports this type of query in your PR #7067 description,

presto> select multimap_agg(x,y) from (values (2, array[3,1,null]), (1, array[9,2,7,null])) AS tbl(x,y);
                  _col0
-----------------------------------------
 {1=[[9, 2, 7, null]], 2=[[3, 1, null]]}
(1 row)

Perhaps the following one is a case of the issue, for multimap_mapp we should check the nested nulls of the kyes.

presto> select multimap_agg(x,y) from (values (array[1,null], 1), (array[2,null], 2)) AS tbl(x,y);
Query 20231015_083424_00006_cjx58 failed: ARRAY comparison not supported for arrays with null elements
Krishna-Prasad-P-V commented 1 year ago
select multimap_agg(x,y) from (values (array[1,null], 1), (array[2,null], 2)) AS tbl(x,y);

@duanmeng - We don't support key as a complex type in Velox. So may be add that first and then check for nested NULLs like Presto, right?

presto:tpch> select multimap_agg(x,y) from (values (array[1,null], 1), (array[2,null], 2)) AS tbl(x,y);
Query 20231016_065749_00031_ns577 failed:  Aggregate function signature is not supported: presto.default.multimap_agg(ARRAY<INTEGER>, INTEGER). Supported signatures: (boolean,V) -> map(boolean,array(V)) -> map(boolean,array(V)), (tinyint,V) -> map(tinyint,array(V)) -> map(tinyint,array(V)), (smallint,V) -> map(smallint,array(V)) -> map(smallint,array(V)), (integer,V) -> map(integer,array(V)) -> map(integer,array(V)), (bigint,V) -> map(bigint,array(V)) -> map(bigint,array(V)), (real,V) -> map(real,array(V)) -> map(real,array(V)), (double,V) -> map(double,array(V)) -> map(double,array(V)), (timestamp,V) -> map(timestamp,array(V)) -> map(timestamp,array(V)), (date,V) -> map(date,array(V)) -> map(date,array(V)), (varbinary,V) -> map(varbinary,array(V)) -> map(varbinary,array(V)), (varchar,V) -> map(varchar,array(V)) -> map(varchar,array(V)), (unknown,V) -> map(unknown,array(V)) -> map(unknown,array(V)).
duanmeng commented 1 year ago

We don't support key as a complex type in Velox. So may be add that first and then check for nested NULLs like Presto, right?

@Krishna-Prasad-P-V Likely so. cc @mbasmanova @aditi-pandit

Update: Presto supports a complex type key for multimap_agg. I tried select orderkey, map_agg(array[orderkey], 1) from lineitem group by orderkey limit 10, and it succeeded in the Presto exec engine. The client failed to parse the result as JSON (see https://github.com/prestodb/presto/commit/16ff15f7584bfe980a8c673b37f3e9c3d89bf248), but this is not the blocker.