apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.14k stars 2.14k forks source link

UUID type support in Spark is incomplete? #4038

Closed sauliusvl closed 1 year ago

sauliusvl commented 2 years ago

Consider a table created in Trino (it has a native UUID type that maps to UUID in Iceberg):

CREATE TABLE test (id UUID) WITH (location = 'hdfs://rr-hdpz1/user/iceberg/test', format = 'PARQUET');
insert into test values (uuid '12151fd2-7586-11e9-8f9e-2a86e4085a59')

Everything looks good, I can query the table, running parquet meta on the resulting file in HDFS suggests it's written correctly according to the parquet specification (byte array of length 16 with logical type UUID):

message table {
  optional fixed_len_byte_array(16) id (UUID) = 1;
}

now I try to read it in Spark:

scala> spark.sql("select * from test").printSchema
root
 |-- id: string (nullable = true)

scala> spark.sql("select * from test").show(false)
22/02/03 12:16:18 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (bn-hdpz1.vinted.net executor 2): java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.unsafe.types.UTF8String ([B is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app')
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)

Same thing when trying to insert from spark:

scala> spark.sql("insert into test values ('e9238c3e-3aa6-4668-aceb-f9507a8f8d59')")
22/02/03 12:42:29 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4) (hx-hdpz2.vinted.net executor 3): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class [B (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; [B is in module java.base of loader 'bootstrap')
    at org.apache.iceberg.spark.data.SparkParquetWriters$ByteArrayWriter.write(SparkParquetWriters.java:291)

The docs seem to suggest that UUID should be converted to a string in Spark, but after reading the source code I don't see how is this supposed to work: the UUID type gets simply mapped to String here, which is not enough - the byte representation of a UUID can't be cast to String straightforwardly.

The way I see it Iceberg should either:

fredkastl commented 2 years ago

See also this related issue: #4581

rdblue commented 1 year ago

I think we just need to add a couple classes to convert string/bytes for UUIDs and update the builders to use them. @nastra is looking into this.

nastra commented 1 year ago

I opened https://github.com/apache/iceberg/pull/7399 to address this for Spark

nastra commented 1 year ago

This has been fixed by https://github.com/apache/iceberg/pull/7399 and also backported to Spark 3.3 + 3.2 with https://github.com/apache/iceberg/pull/7496 / https://github.com/apache/iceberg/pull/7497

igniris87 commented 3 months ago

hey nastra how can I use this fix https://github.com/apache/iceberg/pull/7399 using pyspark and aws glue?

nastra commented 3 months ago

hey nastra how can I use this fix #7399 using pyspark and aws glue?

@igniris87 make sure to use Iceberg 1.3.0+ as this is the version where the fix was shipped with