apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
309 stars 114 forks source link

write UUID fail on _check_schema_compatible #855

Open raphaelauv opened 1 week ago

raphaelauv commented 1 week ago

Apache Iceberg version

main (development)

Please describe the bug 🐞

I can't write a UUID in an iceberg table

from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType
import polars as pl
import uuid

catalog = RestCatalog(
    "default",
    **{
        "uri": "http://localhost:8181",
        "warehouse": "s3://test-bucket/",
        "s3.endpoint": "http://localhost:9020",
    },
)

catalog.create_namespace("default")
id_to_write = uuid.uuid4()

iceberg_schema = Schema(
    NestedField(1, "id", UUIDType(), required=True),
)
catalog.create_table(
    "default.aaa",
    schema=iceberg_schema,
)
df = pl.DataFrame({}).with_columns([pl.lit(id_to_write.bytes).alias("id")])

df = df.to_arrow()

df = df.cast(target_schema=iceberg_schema.as_arrow())

table = catalog.load_table("default.aaa")
table.append(df)

image

but if I comment the call to _check_schema_compatible then it write to the table

https://github.com/apache/iceberg-python/blob/a6cd0cf325b87b360077bad1d79262611ea64424/pyiceberg/table/__init__.py#L485

and I can read the data with trino

Screenshot from 2024-06-25 13-43-09

kevinjqliu commented 1 week ago

thanks for reporting this issue!

The _check_schema_compatible is currently more strict than it should be. In #829, the _check_schema_compatible check is relaxed. Would #829 fix your issue above?

raphaelauv commented 6 days ago

hey @kevinjqliu I tried your PR it do not fix the insert of UUID

kevinjqliu commented 6 days ago

I see, I also verified that _check_schema_compatible errors.

Heres an example to repro:

def test_schema_uuid() -> None:
    import polars as pl

    iceberg_schema = Schema(
        NestedField(1, "id", UUIDType(), required=True),
    )

    id_to_write = uuid.uuid4()
    df = pl.DataFrame({}).with_columns([pl.lit(id_to_write.bytes).alias("id")])
    df = df.to_arrow()
    df = df.cast(target_schema=iceberg_schema.as_arrow())

    _check_schema_compatible(iceberg_schema, df.schema)

Looks like @Fokko opened an issue regarding UUID for arrow https://github.com/apache/arrow/issues/15058

@Fokko can you chime in here on writing UUID data type?

Fokko commented 5 days ago

Thanks for pinging me here. So there is some progress on the Arrow side. There has been a vote to adopt the UUID type, and it has been added to the format.

Thanks for the example code @kevinjqliu:

image

And I would say that they are equivalent. So if we know that the field in the Iceberg table is a UUID, just writing a Fixed[16] is okay and should pass the compatibility check.