Open Feng-Jiang28 opened 1 month ago
Spark is producing an incorrect answer. Why is it only returning a middle column? also why is it showing a value for all rows and ignoring the p=2 filter? There is an issue with our code that we need to look into. I just want to be sure that there is not something in Spark that is horribly wrong too. What version of Spark did you use for this test?
Okay I just ran the query on Spark 3.4.2 and I get the answer that I would expect.
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.2
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_422)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val dataSourceName = "parquet"
dataSourceName: String = parquet
scala> val path = ".../contacts"
path: String = .../contacts
scala> val schema = ("`id` INT,`name` STRUCT<`first`: STRING, `middle`: STRING, `last`: STRING>, " +
| "`address` STRING,`pets` INT,`friends` ARRAY<STRUCT<`first`: STRING, `middle`: STRING, " +
| "`last`: STRING>>,`relatives` MAP<STRING, STRUCT<`first`: STRING, `middle`: STRING, " +
| "`last`: STRING>>,`employer` STRUCT<`id`: INT, `company`: STRUCT<`name`: STRING, " +
| "`address`: STRING>>,`relations` MAP<STRUCT<`first`: STRING, `middle`: STRING, " +
| "`last`: STRING>,STRING>,`p` INT")
schema: String = `id` INT,`name` STRUCT<`first`: STRING, `middle`: STRING, `last`: STRING>, `address` STRING,`pets` INT,`friends` ARRAY<STRUCT<`first`: STRING, `middle`: STRING, `last`: STRING>>,`relatives` MAP<STRING, STRUCT<`first`: STRING, `middle`: STRING, `last`: STRING>>,`employer` STRUCT<`id`: INT, `company`: STRUCT<`name`: STRING, `address`: STRING>>,`relations` MAP<STRUCT<`first`: STRING, `middle`: STRING, `last`: STRING>,STRING>,`p` INT
scala> spark.read.format(dataSourceName).schema(schema).load(path).createOrReplaceTempView("contacts")
scala> val query = spark.sql("select name.middle, address from contacts where p=2")
query: org.apache.spark.sql.DataFrame = [middle: string, address: string]
scala> query.show()
+------+---------------+
|middle| address|
+------+---------------+
| null|567 Maple Drive|
| null|6242 Ash Street|
+------+---------------+
@Feng-Jiang28 was that a copy/paste error? Or were you using a different version of Spark?
@revans2 It was a copy paste issue, thanks for pointing it out.
Description:
This bug is similar as the https://github.com/NVIDIA/spark-rapids/issues/11619 contacts parquet is defined as following and has saved here: contacts.zip
Code to reproduce:
Spark:
Rapids: