[VL] Unsupported spark function list [please leave a comment if you plan to pick some]

PHILO-HE commented 11 months ago

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function. You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:

spark sql expression
spark built-in functions
[x] percentile_approx/approx_percentile (WIP, guangxin)
[x] concat_ws (PR ready, https://github.com/facebookincubator/velox/pull/8854)
[x] unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
[x] locate
[x] parse_url (PR drafted, not merged)
[x] urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
[ ] normalizenanandzero
[x] arrayintersects
[ ] default.json_split (udf, no need to impl.): "external UDF"
[ ] parsejsonarray: "external UDF"
[x] struct
[x] percentile (@Yohahaha)
[x] first/first_value (@JkSelf)
[x] last/last_value (@JkSelf)
[x] posexplode (WIP, @marin-ma)
[x] trunc (WIP, HannanKan)
[x] months_between (PR ready)
[x] date_trunc (WIP, HannanKan)
[ ] stack
[ ] grouping_id
[x] printf (@Surbhi-Vijay)
[x] space (WIP, rhh777)
[x] inline (WIP, @marin-ma)
[x] to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
[ ] from_csv
[ ] from_json
[ ] json_object_keys
[ ] json_tuple
[ ] schema_of_csv
[ ] schema_of_json
[ ] to_csv
[x] to_json (Suppose workable with folly function used)
[x] make_ym_interval (WIP, @marin-ma)
[x] make_timestamp (WIP, @marin-ma)
[ ] make_interval
[ ] make_dt_interval
[ ] monotonically_increasing_id
[x] from_utc_timestamp (@acvictor)
[ ] extract
[ ] exists (@lyy-pineapple)
[ ] date_part
[ ] zip_with
[x] transform (@Yohahaha)
[ ] transform_keys
[ ] transform_values
[x] map_from_entries (WIP, MaYan)
[x] map_filter (WIP, MaYan)
[x] map_entries (Done, by MaYan)
[ ] map_concat
[x] forall (@lyy-pineapple)
[x] flatten (@ivoson)
[ ] filter
[x] filter (array) (@ivoson)
[ ] width_bucket
[x] array_sort (@boneanxs)
[ ] xpath
[ ] xpath_boolean
[ ] xpath_double
[ ] xpath_float
[ ] xpath_int
[ ] xpath_long
[ ] xpath_number
[ ] xpath_short
[ ] xpath_string
[ ] unbase64 (WIP, @fyp711)
[ ] decode (partially supported if translated to caseWhen. WIP Cody)
[ ] initcap (WIP, velox PR: 8676)
[x] unix_date (velox PR 8725, completed)
[ ] count_min_sketch
[x] bool_and/every (@mskapilks)
[x] bool_or/any/some (@mskapilks)
[x] shuffle (completed)
[x] bround (@xumingming)
[x] format_string (@gaoyangxiaozhu)
[x] format_number (@gaoyangxiaozhu)
[x] soundex (@zhli1142015)
[x] levenshtein (@zhli1142015)
[x] cot (@honeyhexin)
[x] expm1 (@Donvi)
[x] stack (generator function, @xumingming)
[x] randn (@Donvi)
[x] empty2null (internal function, @jinchengchenghh)
[x] toprettystring (internal function, @jinchengchenghh)
[x] AtLeastNNonNulls (internal funciton, @zhli1142015)
Since Spark-3.3 (related to ML, low priority)
[ ] regr_count
[ ] regr_avgx
[ ] regr_avgy
[x] regr_r2
[ ] regr_sxx
[x] regr_sxy
[ ] regr_syy
[ ] regr_slope
[ ] regr_intercept
Since Spark-3.3
Since Spark-3.4
[ ] mode
[x] get (@Yohahaha)
[x] array_append (@ivoson)
[x] array_insert (@ivoson)
[x] mode (@zhli1142015)

Yohahaha commented 10 months ago

I'd like support hex and unhex.

update: hex and unhex has already supported in Gluten.

zwangsheng commented 10 months ago

Hi i'd like to give a try with hour function.

konjac commented 10 months ago

Hi, I'd like to have a look into map_keys

fyp711 commented 10 months ago

Hi I'd like to support find_in_set in velox

HannanKan commented 10 months ago

Hi, I'd like to support date_trunc/trunc.

JkSelf commented 10 months ago

Hi, I'd like to support dense_rank.

JkSelf commented 10 months ago

dense_rank already supported in velox https://github.com/facebookincubator/velox/pull/6289.

zhztheplayer commented 10 months ago

[ ] percentile_approx

[ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

PHILO-HE commented 10 months ago

[ ] percentile_approx

[ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Yes, they are one thing. Just unify them into one checkbox. Thanks!

JkSelf commented 10 months ago

I will take a look ntile window function.

zhouyuan commented 10 months ago

ubase64: https://github.com/oap-project/gluten/pull/4482

zjuwangg commented 10 months ago

Is there any plan to suppport from_json function?

yma11 commented 9 months ago

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

acvictor commented 9 months ago

I'd like to give date_from_unix_date a shot

PHILO-HE commented 9 months ago

Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.

to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day

acvictor commented 9 months ago

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Surbhi-Vijay commented 9 months ago

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

PHILO-HE commented 9 months ago

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Thanks so much for your feedback! Just removed it from the list.

acvictor commented 9 months ago

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Will do minute as well.

rui-mo commented 9 months ago

I'd like to work on locate and arrayintersect.

mskapilks commented 8 months ago

I would like to work on bool_and, bool_or

zhztheplayer commented 8 months ago

[x] collect_list (velox supported, needs Gluten to enable array for project plan node)

[x] collect_set

@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).

Surbhi-Vijay commented 8 months ago

I would like to give printf a try.

mskapilks commented 8 months ago

I would like to work on bool_and, bool_or

These are already supported it seems. All bool_and, bool_or, every, some get converted to min, max of bool column

yma11 commented 8 months ago

@PHILO-HE I would like to take map_filter. BTW, map_entries is completed by PR.

supermem613 commented 8 months ago

@PHILO-HE , I'd like to pick up base64 and unbase64, please.

(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).

PHILO-HE commented 7 months ago

@PHILO-HE , I'd like to pick up base64 and unbase64, please.

(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).

Hi @supermem613, sorry for the late reply. I note Gluten PR https://github.com/apache/incubator-gluten/pull/5242 is trying to re-use Velox's existing from_base64 function (proposed for prestosql) for unbase64. Not sure whether we can map base64 to some other function. If there is no semantic difference, we can just re-use the existing Velox functions.

PHILO-HE commented 7 months ago

Just removed the below supported functions from the above list. Thanks for the contribution! last_day, unhex, lead, lag, minute, second, may_keys

ivoson commented 7 months ago

Hi @PHILO-HE I'd like to take filter (array filter), thanks.

Yohahaha commented 7 months ago

I'd like take percentile agg function.

supermem613 commented 7 months ago

Hey, @PHILO-HE , what's the plan with concat_ws ? It says "PR Ready" and I see that you have committed an implementation to this branch: https://github.com/oap-project/velox/commit/c5eec030464970b83389c598354c0da4c8fb25ef. Is that the PR being referenced? Is it planned to be merged into velox main?

PHILO-HE commented 7 months ago

Hey, @PHILO-HE , what's the plan with concat_ws ? It says "PR Ready" and I see that you have committed an implementation to this branch: oap-project/velox@c5eec03. Is that the PR being referenced? Is it planned to be merged into velox main?

Hi @supermem613, that commit is not used by Gluten main branch. We have another implementation for upstream velox: https://github.com/facebookincubator/velox/pull/8854. It is still under review. I will try to push the progress.

Yohahaha commented 7 months ago

I'd like take get function, as known as GetArrayItem.

Yohahaha commented 7 months ago

I'd like take transform function.

lyy-pineapple commented 7 months ago

@PHILO-HE hello, I'd like take forall function.

ivoson commented 7 months ago

I'd like to take flatten function.

acvictor commented 7 months ago

I'd like to try array_size.

lyy-pineapple commented 7 months ago

@PHILO-HE hello, I'd like take forall function.

and exists(array) also support

gaoyangxiaozhu commented 7 months ago

hey @zhouyuan could you help add format_string and format_number in the list and I would take format_string and format_number later

PHILO-HE commented 7 months ago

hey @zhouyuan could you help add format_string and format_number in the list and I would take format_string and format_number later

@gaoyangxiaozhu, just added them into the list. Thanks!

zhli1142015 commented 7 months ago

I'd like to take soundex and levenshtein, thanks.

honeyhexin commented 6 months ago

I'd like to take cot, thanks.

Donvi commented 6 months ago

I'd like and am working in the math function expm1.

gaoyangxiaozhu commented 6 months ago

PR for width_bucket support, https://github.com/apache/incubator-gluten/pull/5634 looks still need velox side change for to support case as bucket_number <=0, will send PR in velox repository to fix

ivoson commented 6 months ago

I'd like to implement array_append and array_insert for spark 3.4+

xumingming commented 6 months ago

I'd like to take a look at stack function, it seems like a Generator, meaning one row of input might return multiple rows of output, does Velox has this generator ability?

marin-ma commented 6 months ago

I'd like to take a look at stack function, it seems like a Generator, meaning one row of input might return multiple rows of output, does Velox has this generator ability?

@xumingming Currently, 4 generator functions are supported : explode, pos_explode, inline and json_tuple. The approach is creating a ProjectNode + UnnestNode + ProjectNode pattern in Velox pipeline. But seems like the stack function cannot use this pattern. Perhaps we can build another pipeline by leveraging the ExpandNode in Velox (Not sure if this approach really works).

xumingming commented 6 months ago

@marin-ma Thanks for the advice, I will take a look.

NEUpanning commented 6 months ago

I'd like to take unix_date, thanks.

PHILO-HE commented 6 months ago

I'd like to take unix_date, thanks.

@NEUpanning, we have supported it in both Gluten & Velox. Just changed its state in the list. Thanks! https://github.com/apache/incubator-gluten/pull/5287 https://github.com/facebookincubator/velox/pull/8725

apache / incubator-gluten

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] #4039

Description

Reference:

spark built-in functions