apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.14k stars 411 forks source link

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] #4039

Open PHILO-HE opened 8 months ago

PHILO-HE commented 8 months ago

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function. You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:

NEUpanning commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

PHILO-HE commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported. https://github.com/apache/incubator-gluten/blob/d74fc97cf941759c79f440b0df5c5071655b984e/backends-velox/src/test/scala/org/apache/gluten/execution/ScalarFunctionsValidateSuite.scala#L808

NEUpanning commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

https://github.com/apache/incubator-gluten/blob/d74fc97cf941759c79f440b0df5c5071655b984e/backends-velox/src/test/scala/org/apache/gluten/execution/ScalarFunctionsValidateSuite.scala#L808

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

xumingming commented 3 months ago

shuffle, array_sort are already supported, can be marked as complete.

xumingming commented 3 months ago

I will take a look at bround.

PHILO-HE commented 3 months ago

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md. date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported. https://github.com/apache/incubator-gluten/blob/d74fc97cf941759c79f440b0df5c5071655b984e/backends-velox/src/test/scala/org/apache/gluten/execution/ScalarFunctionsValidateSuite.scala#L808

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

@NEUpanning, not a direct replacement. date_part is covered here. to_date is converted to Cast + GetTimestamp by Spark.

PHILO-HE commented 3 months ago

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

Donvi commented 3 months ago

As I see only rand exists and no randn, I'm taking randn

xumingming commented 3 months ago

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

@PHILO-HE array_sort is marked as supported in the doc: https://github.com/apache/incubator-gluten/blob/e5dcbe3884d5215cc652246476b1ec980c859d4c/docs/velox-backend-support-progress.md?plain=1#L273

And there is a test for collect_set which used array_sort https://github.com/apache/incubator-gluten/blob/d35d1dc5e4450fdf58b8092ea26a0c928de29a48/backends-velox/src/test/scala/org/apache/gluten/execution/VeloxAggregateFunctionsSuite.scala#L846

PHILO-HE commented 3 months ago

And there is a test for collect_set which used array_sort

https://github.com/apache/incubator-gluten/blob/d35d1dc5e4450fdf58b8092ea26a0c928de29a48/backends-velox/src/test/scala/org/apache/gluten/execution/VeloxAggregateFunctionsSuite.scala#L846

@xumingming, this test only confirms aggregate is offloaded. In my local test, array_sort is not offloaded actually.

boneanxs commented 3 months ago

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

Donvi commented 3 months ago

ubase64: #4482

I see you've map the from_base64 to unbase64, and respectively I find the base64 is almost the same as to_base64, so it's just a missing or is there any other consideration?

PHILO-HE commented 3 months ago

ubase64: #4482

I see you've map the from_base64 to unbase64, and respectively I find the base64 is almost the same as to_base64, so it's just a missing or is there any other consideration?

@Donvi, seems there are a few semantic differences between Spark's unbase64 & Velox's from_base64. So the simple mapping has not been accepted by the community. See discussion: https://github.com/apache/incubator-gluten/pull/5242#discussion_r1548887962. I guess similarly to_base64 cannot be mapped due to some unknown differences.

gaoyangxiaozhu commented 2 months ago

FYI, i am working for mask function support. @PHILO-HE

zhli1142015 commented 2 months ago

I'd like to pick up mode, thanks

jinchengchenghh commented 2 months ago

Can you add empty2null to the list? @PHILO-HE

PHILO-HE commented 2 months ago

Can you add empty2null to the list? @PHILO-HE

Just added.

jinchengchenghh commented 2 months ago

Thanks!

jinchengchenghh commented 2 months ago

Can you add the function toprettystring to the list? Thanks! @PHILO-HE This query will use it I will take it.

select        sum(hash(floor(l_extendedprice)) *l_discount + hash(l_orderkey) + hash(l_partkey) + hash(l_suppkey) + hash(l_linenumber) + hash(l_comment) + hash(l_shipinstruct)) as revenue from      lineitem;
zhli1142015 commented 1 month ago

I would lie to take AtLeastNNonNulls, thanks.

jinchengchenghh commented 1 month ago

Here list some other functions that not support: https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L62 Here list some function some data type or some behavior does not aligns with Spark. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L188