[VL] Unsupported spark function list [please leave a comment if you plan to pick some]

PHILO-HE commented 8 months ago

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function. You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:

spark sql expression
spark built-in functions
[x] percentile_approx/approx_percentile (WIP, guangxin)
[x] concat_ws (PR ready, https://github.com/facebookincubator/velox/pull/8854)
[x] unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
[x] locate
[x] parse_url (PR drafted, not merged)
[x] urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
[ ] normalizenanandzero
[x] arrayintersects
[ ] default.json_split (udf, no need to impl.): "external UDF"
[ ] parsejsonarray: "external UDF"
[x] struct
[x] percentile (@Yohahaha)
[x] first/first_value (@JkSelf)
[x] last/last_value (@JkSelf)
[x] posexplode (WIP, @marin-ma)
[x] trunc (WIP, HannanKan)
[x] months_between (PR ready)
[x] date_trunc (WIP, HannanKan)
[ ] stack
[ ] grouping_id
[x] printf (@Surbhi-Vijay)
[x] space (WIP, rhh777)
[x] inline (WIP, @marin-ma)
[x] to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
[ ] from_csv
[ ] from_json
[ ] json_object_keys
[ ] json_tuple
[ ] schema_of_csv
[ ] schema_of_json
[ ] to_csv
[x] to_json (Suppose workable with folly function used)
[x] make_ym_interval (WIP, @marin-ma)
[x] make_timestamp (WIP, @marin-ma)
[ ] make_interval
[ ] make_dt_interval
[x] from_utc_timestamp (@acvictor)
[ ] extract
[ ] exists (@lyy-pineapple)
[ ] date_part
[ ] zip_with
[x] transform (@Yohahaha)
[ ] transform_keys
[ ] transform_values
[x] map_from_entries (WIP, MaYan)
[x] map_filter (WIP, MaYan)
[x] map_entries (Done, by MaYan)
[ ] map_concat
[x] forall (@lyy-pineapple)
[x] flatten (@ivoson)
[ ] filter
[x] filter (array) (@ivoson)
[ ] width_bucket
[x] array_sort (@boneanxs)
[ ] xpath
[ ] xpath_boolean
[ ] xpath_double
[ ] xpath_float
[ ] xpath_int
[ ] xpath_long
[ ] xpath_number
[ ] xpath_short
[ ] xpath_string
[ ] unbase64 (WIP, @fyp711)
[ ] decode (partially supported if translated to caseWhen. WIP Cody)
[ ] initcap (WIP, velox PR: 8676)
[x] unix_date (velox PR 8725, completed)
[ ] count_min_sketch
[x] bool_and/every (@mskapilks)
[x] bool_or/any/some (@mskapilks)
[x] shuffle (completed)
[x] bround (@xumingming)
[x] format_string (@gaoyangxiaozhu)
[x] format_number (@gaoyangxiaozhu)
[x] soundex (@zhli1142015)
[x] levenshtein (@zhli1142015)
[x] cot (@honeyhexin)
[x] expm1 (@Donvi)
[x] stack (generator function, @xumingming)
[x] randn (@Donvi)
[x] empty2null (internal function, @jinchengchenghh)
[x] toprettystring (internal function, @jinchengchenghh)
[x] AtLeastNNonNulls (internal funciton, @zhli1142015)
Since Spark-3.3 (related to ML, low priority)
[ ] regr_count
[ ] regr_avgx
[ ] regr_avgy
[x] regr_r2
[ ] regr_sxx
[x] regr_sxy
[ ] regr_syy
[ ] regr_slope
[ ] regr_intercept
Since Spark-3.3
Since Spark-3.4
[ ] mode
[x] get (@Yohahaha)
[x] array_append (@ivoson)
[x] array_insert (@ivoson)
[x] mode (@zhli1142015)

NEUpanning commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

PHILO-HE commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported. https://github.com/apache/incubator-gluten/blob/d74fc97cf941759c79f440b0df5c5071655b984e/backends-velox/src/test/scala/org/apache/gluten/execution/ScalarFunctionsValidateSuite.scala#L808

NEUpanning commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

https://github.com/apache/incubator-gluten/blob/d74fc97cf941759c79f440b0df5c5071655b984e/backends-velox/src/test/scala/org/apache/gluten/execution/ScalarFunctionsValidateSuite.scala#L808

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

xumingming commented 3 months ago

shuffle, array_sort are already supported, can be marked as complete.

xumingming commented 3 months ago

I will take a look at bround.

PHILO-HE commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md. date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported. https://github.com/apache/incubator-gluten/blob/d74fc97cf941759c79f440b0df5c5071655b984e/backends-velox/src/test/scala/org/apache/gluten/execution/ScalarFunctionsValidateSuite.scala#L808

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

@NEUpanning, not a direct replacement. date_part is covered here. to_date is converted to Cast + GetTimestamp by Spark.

PHILO-HE commented 3 months ago

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

Donvi commented 3 months ago

As I see only rand exists and no randn, I'm taking randn

xumingming commented 3 months ago

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

@PHILO-HE array_sort is marked as supported in the doc: https://github.com/apache/incubator-gluten/blob/e5dcbe3884d5215cc652246476b1ec980c859d4c/docs/velox-backend-support-progress.md?plain=1#L273

And there is a test for collect_set which used array_sort https://github.com/apache/incubator-gluten/blob/d35d1dc5e4450fdf58b8092ea26a0c928de29a48/backends-velox/src/test/scala/org/apache/gluten/execution/VeloxAggregateFunctionsSuite.scala#L846

PHILO-HE commented 3 months ago

And there is a test for collect_set which used array_sort

https://github.com/apache/incubator-gluten/blob/d35d1dc5e4450fdf58b8092ea26a0c928de29a48/backends-velox/src/test/scala/org/apache/gluten/execution/VeloxAggregateFunctionsSuite.scala#L846

@xumingming, this test only confirms aggregate is offloaded. In my local test, array_sort is not offloaded actually.

boneanxs commented 3 months ago

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

Donvi commented 3 months ago

ubase64: #4482

I see you've map the from_base64 to unbase64, and respectively I find the base64 is almost the same as to_base64, so it's just a missing or is there any other consideration?

PHILO-HE commented 3 months ago

ubase64: #4482

I see you've map the from_base64 to unbase64, and respectively I find the base64 is almost the same as to_base64, so it's just a missing or is there any other consideration?

@Donvi, seems there are a few semantic differences between Spark's unbase64 & Velox's from_base64. So the simple mapping has not been accepted by the community. See discussion: https://github.com/apache/incubator-gluten/pull/5242#discussion_r1548887962. I guess similarly to_base64 cannot be mapped due to some unknown differences.

gaoyangxiaozhu commented 2 months ago

FYI, i am working for mask function support. @PHILO-HE

zhli1142015 commented 2 months ago

I'd like to pick up mode, thanks

jinchengchenghh commented 2 months ago

Can you add empty2null to the list? @PHILO-HE

PHILO-HE commented 2 months ago

Can you add empty2null to the list? @PHILO-HE

Just added.

jinchengchenghh commented 2 months ago

Thanks!

jinchengchenghh commented 2 months ago

Can you add the function toprettystring to the list? Thanks! @PHILO-HE This query will use it I will take it.

select        sum(hash(floor(l_extendedprice)) *l_discount + hash(l_orderkey) + hash(l_partkey) + hash(l_suppkey) + hash(l_linenumber) + hash(l_comment) + hash(l_shipinstruct)) as revenue from      lineitem;

zhli1142015 commented 1 month ago

I would lie to take AtLeastNNonNulls, thanks.

jinchengchenghh commented 1 month ago

Here list some other functions that not support: https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L62 Here list some function some data type or some behavior does not aligns with Spark. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L188

apache / incubator-gluten

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] #4039

Description

Reference:

spark built-in functions