[FEA] Support spark.sql.readSideCharPadding for CHAR columns with Spark 3.4+

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

https://nvidia.github.io/spark-rapids

Apache License 2.0

806 stars 234 forks source link

[FEA] Support spark.sql.readSideCharPadding for CHAR columns with Spark 3.4+ #8324

Open mythrocks opened 1 year ago

mythrocks commented 1 year ago

Spark 3.4 changed the semantics of reading CHAR columns from ORC files.

Consider the following table:

CREATE TABLE foobar ( foo char(3) ) STORED AS ORCFILE LOCATION '/tmp/foobar';

INSERT INTO FOOBAR VALUES (""), ("0"), ("1 "), (" 1"), ("22"), ("4444"), (NULL);

When this data is read from Spark < 3.4, it returns:

  SELECT foo, CONCAT ('"', foo, '"'), LENGTH(foo) FROM foobar;
        ""      0
0       "0"     1
1       "1"     1
 1      " 1"    2
22      "22"    2
444     "444"   3
NULL    NULL    NULL

With Spark 3.4, this changes to:

  SELECT foo, CONCAT ('"', foo, '"'), LENGTH(foo) FROM foobar;
        "   "   3
0       "0  "   3
1       "1  "   3
 1      " 1 "   3
22      "22 "   3
444     "444"   3
NULL    NULL    NULL

It would be good to support the new behaviour with the Spark RAPIDS plugin.

(This is incidental fallout from #8321. This behaviour needs to be moved to Shims now.)

mythrocks commented 1 year ago

This is interesting. Here is the "failing" read plan:

GpuColumnarToRow false
+- GpuProject [foo#24, gpuconcat(", foo#24, ") AS concat(", foo, ")#21, length(foo#24) AS length(foo)#22]
   +- GpuRowToColumnar targetsize(1073741824)
      +- *(1) Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, foo#20, 3, true, false, true) AS foo#24]
         +- GpuColumnarToRow false
            +- GpuFileGpuScan orc spark_catalog.default.foobar[foo#20] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/tmp/foobar], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<foo:string>

The read isn't strictly failing on GPU. The GPU read presents right-trimmed strings (like in Spark 3.3). The CPU then adds the spaces back, to pad back to expected width, via CharVarcharCodegenUtils.readSidePadding()

The worst of this is that it falls off the GPU (column->row->column), and then does string padding on CPU.

We don't produce bad reads. But we could choose to go much faster, if we just presented what the CUDF reader reads. But then we would have to intercept code-gen. This might be a bit of work.

mythrocks commented 1 year ago

I've set this to low priority. There is no data corruption, bad read, etc.

gerashegalov commented 4 months ago

This issue is not limited to ORC, but also Parquet or any other supported storage. It is controlled by https://github.com/apache/spark/blob/7a1608bbc3f1dfd7ffd1f9dc762cb369f47a8d43/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4628-L4634