facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.41k stars 1.11k forks source link

str_to_map Spark function doesn't match Spark #10502

Open nullptroot opened 1 month ago

nullptroot commented 1 month ago

Bug description

The meaning of Spark UDF stri_to-map is different from the current implementation of UDF. The last two of the three parameters in Spark are optional, but the current implementation is fixed. image image

System information

Velox System Info v0.0.2 Commit: 09e1b0d0e77738bd6974922c60e921906c41942d CMake Version: 3.26.5 System: Linux-5.4.241-1-tlinux4-0017.10 Arch: x86_64 C++ Compiler: /usr/lib64/ccache/c++ C++ Compiler Version: 8.5.0 C Compiler: /usr/lib64/ccache/cc C Compiler Version: 8.5.0 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

\nThe results will be copied to your clipboard if xclip is installed.

Relevant logs

No response

mbasmanova commented 1 month ago

CC: @rui-mo

rui-mo commented 1 month ago

The last two of the three parameters in Spark are optional, but the current implementation is fixed.

It has been noticed that Spark uses the default values , and : for the last two parameters if the user does not supply them, which allows us to provide them when using Velox str_to_map function. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L567-L573

But I discover that Spark differs in that it does not require the delimiter's size to be 1.

spark.sql("select str_to_map('a:1,b:2,c:3', ',,', '::')").show(false) +-------------------------------+ |str_to_map(a:1,b:2,c:3, ,,, ::)| +-------------------------------+ |{a:1,b:2,c:3 -> NULL} | +-------------------------------+

nullptroot commented 1 month ago

Yes, but Spark allows for optional passing of the last two parameters. Velox cannot pass only one or two parameters, only three parameters can be passed, and Spark can handle situations where the delimiter size is not 1

nullptroot commented 1 month ago

The code I wrote registered two additional function signatures and directly used StrinView as a function parameter, reducing the use of stack space

rui-mo commented 1 month ago

@nullptroot Would you like to work on the fix? If you don't have enough bandwidth, I'm glad to work on it as well.

nullptroot commented 1 month ago

I can do it. Currently, I have just adapted him to input different parameter numbers. I plan to solve the problem of separators larger than 1 in size