Open mythrocks opened 1 year ago
This is interesting. Here is the "failing" read plan:
GpuColumnarToRow false
+- GpuProject [foo#24, gpuconcat(", foo#24, ") AS concat(", foo, ")#21, length(foo#24) AS length(foo)#22]
+- GpuRowToColumnar targetsize(1073741824)
+- *(1) Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, foo#20, 3, true, false, true) AS foo#24]
+- GpuColumnarToRow false
+- GpuFileGpuScan orc spark_catalog.default.foobar[foo#20] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/tmp/foobar], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<foo:string>
The read isn't strictly failing on GPU. The GPU read presents right-trimmed strings (like in Spark 3.3). The CPU then adds the spaces back, to pad back to expected width, via CharVarcharCodegenUtils.readSidePadding()
The worst of this is that it falls off the GPU (column->row->column
), and then does string padding on CPU.
We don't produce bad reads. But we could choose to go much faster, if we just presented what the CUDF reader reads. But then we would have to intercept code-gen. This might be a bit of work.
I've set this to low priority. There is no data corruption, bad read, etc.
This issue is not limited to ORC, but also Parquet or any other supported storage. It is controlled by https://github.com/apache/spark/blob/7a1608bbc3f1dfd7ffd1f9dc762cb369f47a8d43/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4628-L4634
Spark 3.4 changed the semantics of reading
CHAR
columns from ORC files.Consider the following table:
When this data is read from Spark < 3.4, it returns:
With Spark 3.4, this changes to:
It would be good to support the new behaviour with the Spark RAPIDS plugin.
(This is incidental fallout from #8321. This behaviour needs to be moved to Shims now.)