Open liujiayi771 opened 7 months ago
@liujiayi771 do you know if collect_set is not expected to work with complex types if the value is null? Example, this works with Spark, but doesn't work when Gluten is enabled:
import org.apache.spark.sql.functions._
val jsonStr = """{"txn":{"appId":"txnId","version":0,"lastUpdated":null}}"""
val jsonSchema = StructType(Seq(StructField("txn",
StructType(Seq(StructField("appId",StringType,true),StructField("lastUpdated",LongType,true),StructField("version",LongType,true))),true
)))
val df = spark.read.schema(jsonSchema).json(Seq(jsonStr).toDS).select(collect_set(col("txn")))
df.head
Error:
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (c7f5 executor driver): org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxUserError
[info] Error Source: USER
[info] Error Code: INVALID_ARGUMENT
[info] Reason: ROW comparison not supported for values that contain nulls
[info] Retriable: False
[info] Expression: !decoded.base()->containsNullAt(indices[index])
[info] Function: checkNestedNulls
[info] File: /__w/1/s/Velox/velox/functions/lib/CheckNestedNulls.cpp
[info] Line: 34
@felipepessoto This is a known issue, the Velox backend does not yet support it.
Description
collect_list
collect_set
set_agg
.Exclude UTs:
collect_list
in vanilla Spark return an empty array, butarray_agg
in Velox return null.