NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

[FEA] Support array_sort #5227

Open viadea opened 2 years ago

viadea commented 2 years ago

I wish we can support array_sort

eg:

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(array_sort(df.x).alias("sort")).collect()
    ! <ArraySort> array_sort(x#72, lambdafunction(if ((isnull(lambda left#87) AND isnull(lambda right#88))) 0 else if (isnull(lambda left#87)) 1 else if (isnull(lambda right#88)) -1 else if ((lambda left#87 < lambda right#88)) -1 else if ((lambda left#87 > lambda right#88)) 1 else 0, lambda left#87, lambda right#88, false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.ArraySort
revans2 commented 2 years ago

This is going to be difficult because of the lambda. We would really need to have CUDF support this probably using AST, or we are going to have to special case the default ordering lambda expression and use the built in sort that exists.

revans2 commented 2 years ago

I filed https://github.com/rapidsai/cudf/issues/11162 as the majority of the CUDF request.

revans2 commented 2 years ago

I filed https://github.com/rapidsai/cudf/issues/11163 to help us implement/run a number of these commands as written.

We really should be looking at doing some pattern matching as well. Especially for the default case, which I think is the most common.

ttnghia commented 1 year ago

Can we just partially support array_sort without lambda (i.e., only use the default comparison behavior)? By doing so we can just call libcudf sort_lists.

wjxiz1992 commented 11 months ago

+1, this expression is also used in Scale Test Query40.

update: I'm going to do what ttnghia said, partially support it without lambda first.