UUID type support in Spark is incomplete?

sauliusvl commented 2 years ago

Consider a table created in Trino (it has a native UUID type that maps to UUID in Iceberg):

CREATE TABLE test (id UUID) WITH (location = 'hdfs://rr-hdpz1/user/iceberg/test', format = 'PARQUET');
insert into test values (uuid '12151fd2-7586-11e9-8f9e-2a86e4085a59')

Everything looks good, I can query the table, running parquet meta on the resulting file in HDFS suggests it's written correctly according to the parquet specification (byte array of length 16 with logical type UUID):

message table {
  optional fixed_len_byte_array(16) id (UUID) = 1;
}

now I try to read it in Spark:

scala> spark.sql("select * from test").printSchema
root
 |-- id: string (nullable = true)

scala> spark.sql("select * from test").show(false)
22/02/03 12:16:18 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (bn-hdpz1.vinted.net executor 2): java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.unsafe.types.UTF8String ([B is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app')
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)

Same thing when trying to insert from spark:

scala> spark.sql("insert into test values ('e9238c3e-3aa6-4668-aceb-f9507a8f8d59')")
22/02/03 12:42:29 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4) (hx-hdpz2.vinted.net executor 3): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class [B (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; [B is in module java.base of loader 'bootstrap')
    at org.apache.iceberg.spark.data.SparkParquetWriters$ByteArrayWriter.write(SparkParquetWriters.java:291)

The docs seem to suggest that UUID should be converted to a string in Spark, but after reading the source code I don't see how is this supposed to work: the UUID type gets simply mapped to String here, which is not enough - the byte representation of a UUID can't be cast to String straightforwardly.

The way I see it Iceberg should either:

somehow do the conversion back and forth to string automatically upon reading/inserting (not sure if this is possible and it's not optimal performance wise, as the string representation is 32 bytes vs. the raw 16 bytes)
map it to a byte array and force users to convert to java.util.UUID if they really want to see the text representation - optimal, but not user friendly, also not sure how to propagate the logical type to parquet
implement a Spark user defined type for UUID and convert to it - not sure about performance implications here