Closed mbasmanova closed 4 months ago
Note: it would be easier to support arbitrary lambdas if lambda signature was changed from taking a key and 2 values to taking a key and an array of at least 2 values.
Current: (k, v1, v2) -> v Preferred: (k, array(v)) -> v, where input array is guaranteed to have at least 2 entries and the order of entries in the array matches the order in which they appear in the input string.
CC: @kaikalur @tdcmeehan
This could be the presto->velox translation. we don't need to change presto syntax for that
Description
split_to_map Presto function allows user to pass a lambda to decide which value to keep in case where there are duplicate keys.
https://prestodb.io/docs/current/functions/string.html#id2
For example, one can specify a lambda to concatenate all values for the same key:
This function is challenging to implement in a vectorized engine. If a key repeats N times, lambda function needs to be evaluated N - 1 times. This challenge is similar to 'reduce' function.
Let's say key 'k' repeats N times with values v1, v2,...vN. To find out the value to store in the result map, we need to evaluate user-specified lambda 'f' N-1 times:
Looking at the production use cases, we notice that there are only 2 lambdas: (1) pick first value; (2) pick last value.
Hence, we propose to implement partial support for split_to_map lambda function, i.e. implement this function pick-first and pick-last lambdas. This would be somewhat similar to partial support implemented for array_sort: https://velox-lib.io/blog/array-sort
CC: @amitkdutta @Yuhta @rschlussel @bikramSingh91 @pedroerp