Open amahussein opened 5 months ago
The format is controlled by the conf spark.sql.binaryOutputStyle
Setting it to non-default values such BASE64
will lead to discrepancies between CPU and GPU
scala> spark.conf.set("spark.sql.binaryOutputStyle", "BASE64")
scala> spark.read.parquet("/tmp/bf2.pq").printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
|-- b: array (nullable = true)
| |-- element: binary (containsNull = true)
scala> spark.read.parquet("/tmp/bf2.pq").show(truncate=false)
+-----------------------------------------+
|b |
+-----------------------------------------+
|[RWFzb24gWWFvIDIwMTgtMTEtMTc6MTM6MzM6MzM]|
+-----------------------------------------+
scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.parquet("/tmp/bf2.pq").show(truncate=false)
24/07/15 05:46:03 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
@Partitioning <SinglePartition$> could run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> toprettystring(b#131, Some(UTC)) AS toprettystring(b)#134 will run on GPU
*Expression <ToPrettyString> toprettystring(b#131, Some(UTC)) will run on GPU
*Exec <FileSourceScanExec> will run on GPU
+------------------------------------------------------------------------------------------+
|b |
+------------------------------------------------------------------------------------------+
|[[45 61 73 6F 6E 20 59 61 6F 20 32 30 31 38 2D 31 31 2D 31 37 3A 31 33 3A 33 33 3A 33 33]]|
+------------------------------------------------------------------------------------------+
Describe the bug
This PR introduces a universal BinaryFormatter to make binary output consistent across all clients for both primitive and nested binaries.
RAPIDS plugin may be affected by that new change
useHexFormatForBinary
has been removed fromcase Cast
useHexFormatForBinary
has been removed fromToPrettyString.scala
. New methods are defined:ToPrettyString.scala
:val binaryFormatter: BinaryFormatter = ToStringBase.getBinaryFormatter
ToStringBase.scala
:binaryFormatter: BinaryFormatter = UTF8String.fromBytes