apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
308 stars 113 forks source link

Writing an arrow table with date64 unsupported #830

Open vtk9 opened 1 week ago

vtk9 commented 1 week ago

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

TypeError: Unsupported type: date64[ms]
from decimal import Decimal
from pyiceberg.catalog.sql import SqlCatalog
import pyarrow as pa

pylist = [{'decimal_col': 1234}]
arrow_schema = pa.schema(
    [
        pa.field('decimal_col', pa.date64()),
    ],
)
arrow_table = pa.Table.from_pylist(pylist, schema=arrow_schema)

catalog = SqlCatalog(
    'test_catalog',
    **{
        'type': "sql'",
        'uri': 'sqlite:///pyiceberg.db',
    },
)

namespace = 'test_ns'
table_name = 'test_table'

catalog.create_namespace(namespace=namespace)
new_table = catalog.create_table(
    identifier=f'{namespace}.{table_name}',
    schema=arrow_schema,
    location='.',
)

new_table.append(arrow_table)
kevinjqliu commented 1 week ago

date32 is supported here https://github.com/apache/iceberg-python/blob/a29491af52dc4aff46a325bbaac4a11c2f2bfabc/pyiceberg/io/pyarrow.py#L915-L916

likely need to add a new if-statement

vtk9 commented 1 week ago

@kevinjqliu Thanks! There might be other ones that are not supported. uint16 is also not supported while all of the other integer types are

I also created https://github.com/apache/iceberg-python/issues/837 which i found today as another bug when using pyiceberg to write

vtk9 commented 6 days ago

@kevinjqliu as part of this fix, would it be possible to also print out in the Exception what column is causing a problem? i.e 'decimal_col

Should I create a new issue to track this feature request?

Alternatively, return an more specific exception such as UnsupportedPyArrowType and include the pyarrow.Field (column_name, column_type) in the exception?

kevinjqliu commented 6 days ago

as part of this fix, would it be possible to also print out in the Exception what column is causing a problem? i.e 'decimal_col Should I create a new issue to track this feature request?

Yea, that's a great idea. I'm in favor of opening a new issue to track the qualify of life improvement for the error message.

Fokko commented 5 days ago

The problem is that Parquet will encode a date as an int32. Adding the if would probably push the issue down, into the parquet writer. I'm suprised to see this, since a date with int32 has quite a bit of range:

image

As part of this fix, would it be possible to also print out in the Exception what column is causing a problem? i.e 'decimal_col

That's a great idea! 🙌