facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.52k stars 1.15k forks source link

Add split_to_map Presto lambda function #10330

Closed mbasmanova closed 4 months ago

mbasmanova commented 4 months ago

Description

split_to_map Presto function allows user to pass a lambda to decide which value to keep in case where there are duplicate keys.

https://prestodb.io/docs/current/functions/string.html#id2

For example, one can specify a lambda to concatenate all values for the same key:

presto> select split_to_map('1:a,2:b,1:c,1:d', ',', ':');
Query 20240627_001911_00862_bvpmi failed: Duplicate keys (1) are not allowed. Specifying a lambda to resolve conflicts can avoid this error

presto> select split_to_map('1:a,2:b,1:c,1:d', ',', ':', (k, v1, v2) -> v1 || v2);
    _col0
--------------
 {1=acd, 2=b}

This function is challenging to implement in a vectorized engine. If a key repeats N times, lambda function needs to be evaluated N - 1 times. This challenge is similar to 'reduce' function.

Let's say key 'k' repeats N times with values v1, v2,...vN. To find out the value to store in the result map, we need to evaluate user-specified lambda 'f' N-1 times:

v := f(v1, v2)
v := f(v, v3)
v := f(v, v4)
...
v := f(v, vN)

Looking at the production use cases, we notice that there are only 2 lambdas: (1) pick first value; (2) pick last value.

20240626_194901_19541_ggbp2 - (k, v1, v2) -> v2 - pick last
20240624_161158_52122_gmrcw - (k, v1, v2) -> v1 - pick first

Hence, we propose to implement partial support for split_to_map lambda function, i.e. implement this function pick-first and pick-last lambdas. This would be somewhat similar to partial support implemented for array_sort: https://velox-lib.io/blog/array-sort

CC: @amitkdutta @Yuhta @rschlussel @bikramSingh91 @pedroerp

mbasmanova commented 4 months ago

Note: it would be easier to support arbitrary lambdas if lambda signature was changed from taking a key and 2 values to taking a key and an array of at least 2 values.

Current: (k, v1, v2) -> v Preferred: (k, array(v)) -> v, where input array is guaranteed to have at least 2 entries and the order of entries in the array matches the order in which they appear in the input string.

CC: @kaikalur @tdcmeehan

kaikalur commented 4 months ago

This could be the presto->velox translation. we don't need to change presto syntax for that