NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
811 stars 234 forks source link

[BUG] non-utc integration tests failing in json_test.py #11481

Closed abellina closed 1 month ago

abellina commented 1 month ago

We are seeing some integration test failures when the timezone isn't UTC:

[2024-09-18T14:17:46.633Z] FAILED ../../src/main/python/json_test.py::test_basic_from_json[yyyy-MM-dd-false-false-false-StructType(List(StructField(number,DateType,true)))-floats_edge_cases.json][DATAGEN_SEED=1726661844, TZ=Canada/Newfoundland, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec,BatchScanExec,FileSourceScanExec)] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec
[2024-09-18T14:17:46.633Z] FAILED ../../src/main/python/json_test.py::test_basic_from_json[yyyy-MM-dd-false-false-false-StructType(List(StructField(number,DateType,true)))-decimals.json][DATAGEN_SEED=1726661844, TZ=Canada/Newfoundland, INJECT_OOM, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec,BatchScanExec,FileSourceScanExec)] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec
[2024-09-18T14:17:46.633Z] FAILED ../../src/main/python/json_test.py::test_basic_from_json[yyyy-MM-dd-false-false-false-StructType(List(StructField(number,DateType,true)))-dates.json][DATAGEN_SEED=1726661844, TZ=Canada/Newfoundland, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec,BatchScanExec,FileSourceScanExec)] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec
[2024-09-18T12:54:33.809Z] !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
[2024-09-18T12:54:33.809Z]   @Expression <Alias> value#105916 AS json#105918 could run on GPU
[2024-09-18T12:54:33.809Z]     @Expression <AttributeReference> value#105916 could run on GPU
[2024-09-18T12:54:33.809Z]   @Expression <Alias> from_json(StructField(number,BooleanType,true), (allowNumericLeadingZeros,true), (allowNonNumericNumbers,true), value#105916, Some(Canada/Newfoundland)) AS from_json(json)#105920 could run on GPU
[2024-09-18T12:54:33.809Z]     !Expression <JsonToStructs> from_json(StructField(number,BooleanType,true), (allowNumericLeadingZeros,true), (allowNonNumericNumbers,true), value#105916, Some(Canada/Newfoundland)) cannot run on GPU because class org.apache.spark.sql.catalyst.expressions.JsonToStructs is not supported with timezone settings: (JVM: Canada/Newfoundland, session: Canada/Newfoundland). Set both of the timezones to UTC to enable class org.apache.spark.sql.catalyst.expressions.JsonToStructs support
[2024-09-18T12:54:33.809Z]       @Expression <AttributeReference> value#105916 could run on GPU
[2024-09-18T12:54:33.809Z]   !Exec <FileSourceScanExec> cannot run on GPU because unsupported file format: org.apache.spark.sql.execution.datasources.text.TextFileFormat
abellina commented 1 month ago

@revans2 FYI

revans2 commented 1 month ago

@abellina I didn't touch anything related to that. I will see if I can repro it and do a git bisect to find out.

revans2 commented 1 month ago

Sorry I am wrong https://github.com/NVIDIA/spark-rapids/pull/11464 added that in. I didn't remember adding it in, but I guess I did. So I'll fix this.