apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.13k stars 2.13k forks source link

Spark: Add read/write support for UUIDs from bytes #10635

Open raphaelauv opened 2 months ago

raphaelauv commented 2 months ago

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

I can insert a string column to an iceberg UUID column thanks to https://github.com/apache/iceberg/pull/7399

df = df.withColumn("id", lit(str(uuid.uuid4())))

but I can't insert a byte column to an iceberg UUID column

df = df.withColumn("id", lit(uuid.uuid4().bytes))

thanks all

nastra commented 2 months ago

@raphaelauv would you be interested in contributing a fix for this?

raphaelauv commented 2 months ago

hey @nastra, I do not have the time to contribute this feature right now, thanks for the proposition :+1:

until , I'm sharring an hacky bypass :sweat_smile: :

df = df.withColumn(
    "id", 
    F.regexp_replace(
        F.lower(F.hex("id")), 
        "(.{8})(.{4})(.{4})(.{4})(.{12})", 
        "$1-$2-$3-$4-$5"
    )
)
anuragmantri commented 2 months ago

I can give this a shot @nastra. Although I need to read the UUID PR first.

anuragmantri commented 1 month ago

I walked through the code and I was also able to reproduce this issue for parquet writes with a test.

java.lang.IllegalArgumentException: Invalid UUID string: d��Iu���>�M�`
    at java.base/java.util.UUID.fromString(Unknown Source)
    at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:426)
    at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:411)
    at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:581)
    at org.apache.iceberg.parquet.ParquetWriter.add(ParquetWriter.java:135)

It looks like the visitor incorrectly casts byte array to string because of our conversion to spark types here. Should we do this casting correctly at a higher level than SparkParquetWriters?

@RussellSpitzer @nastra

nastra commented 1 month ago

@anuragmantri I believe this is the correct place to do the casting. Spark itself doesn't support UUID as a type and so you can only represent it as a string when you write a UUID.