NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
805 stars 234 forks source link

[BUG] Support nested arrays in array_intersect(...) #6258

Open NVnavkumar opened 2 years ago

NVnavkumar commented 2 years ago

Is your feature request related to a problem? Please describe. The Spark RAPIDS accelerator should support array_intersect on nested array inputs.

Python example:

from pyspark.sql.types import *
from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .master("local")\
    .appName("array_intersect")\
    .config("spark.plugins", "com.nvidia.spark.SQLPlugin")\
    .getOrCreate()

schema = StructType([
    StructField("a", ArrayType(ArrayType(IntegerType()))),
    StructField("b", ArrayType(ArrayType(IntegerType())))
])

data = [
    ([[1,2],[3,4],[5,6]], [[1,2],[3,4,5], [10,11]]),
    ([[10,12],[13,14],[5,6]], [[1,2],[3,4], [5,6]]),
    ([[20,25],[23,24],[25,26]], [[21,22],[23,24,25], [20,21]]),
]

df = spark.sparkContext.parallelize(data).toDF(schema=schema)
df.createOrReplaceTempView("array_intersect_table")

df = spark.sql("SELECT array_intersect(a,b) FROM array_intersect_table")
df.show()
df.collect()

The not-supported error message:

    !Expression <ArrayIntersect> array_intersect(a#0, b#1) cannot run on GPU because array1 expression AttributeReference a#0 (child ArrayType(IntegerType,true) is not supported); array2 expression AttributeReference b#1 (child ArrayType(IntegerType,true) is not supported); expression ArrayIntersect array_intersect(a#0, b#1) produces an unsupported type ArrayType(ArrayType(IntegerType,true),true)
ttnghia commented 2 years ago

Is this just a type check issue or more complicated than that?