NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
812 stars 234 forks source link

[AUDIT] [SPARK-47911][SQL] Introduces a universal BinaryFormatter to make binary output consistent #10884

Open amahussein opened 5 months ago

amahussein commented 5 months ago

Describe the bug

This PR introduces a universal BinaryFormatter to make binary output consistent across all clients for both primitive and nested binaries.

RAPIDS plugin may be affected by that new change

gerashegalov commented 4 months ago

The format is controlled by the conf spark.sql.binaryOutputStyle

Setting it to non-default values such BASE64 will lead to discrepancies between CPU and GPU

scala> spark.conf.set("spark.sql.binaryOutputStyle", "BASE64")

scala> spark.read.parquet("/tmp/bf2.pq").printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- b: array (nullable = true)
 |    |-- element: binary (containsNull = true)

scala> spark.read.parquet("/tmp/bf2.pq").show(truncate=false)
+-----------------------------------------+
|b                                        |
+-----------------------------------------+
|[RWFzb24gWWFvIDIwMTgtMTEtMTc6MTM6MzM6MzM]|
+-----------------------------------------+

scala> spark.conf.set("spark.rapids.sql.enabled", true)

scala> spark.read.parquet("/tmp/bf2.pq").show(truncate=false)
24/07/15 05:46:03 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> toprettystring(b#131, Some(UTC)) AS toprettystring(b)#134 will run on GPU
      *Expression <ToPrettyString> toprettystring(b#131, Some(UTC)) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

+------------------------------------------------------------------------------------------+
|b                                                                                         |
+------------------------------------------------------------------------------------------+
|[[45 61 73 6F 6E 20 59 61 6F 20 32 30 31 38 2D 31 31 2D 31 37 3A 31 33 3A 33 33 3A 33 33]]|
+------------------------------------------------------------------------------------------+