apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
480 stars 177 forks source link

`add_files` raises `KeyError` if parquet file doe not have column stats #1353

Open binayakd opened 3 days ago

binayakd commented 3 days ago

Apache Iceberg version

0.8.0 (latest release)

Please describe the bug 🐞

Using the NYC taxi data set found here, if I follow the standard way of creating catalog, and table, but instead of doing append, I do add_files:

from pyiceberg.catalog.sql import SqlCatalog
import pyarrow.parquet as pq

warehouse_path = "/tmp/warehouse"
data_file_path = "/tmp/test-data" 

catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    }
)

df = pq.read_table(f"{data_file_path}/yellow_tripdata_2024-01.parquet")

catalog.create_namespace("default")

table = catalog.create_table(
    "default.taxi_dataset",
    schema=df.schema,
)

table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])

I get a KeyError:

Traceback (most recent call last):
  File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 42, in <module>
    main()
  File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 29, in main
    table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1036, in add_files
    tx.add_files(
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 594, in add_files
    for data_file in data_files:
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1537, in _parquet_files_to_data_files
    yield from parquet_files_to_data_files(io=io, table_metadata=table_metadata, file_paths=iter(file_paths))
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2535, in parquet_files_to_data_files
    statistics = data_file_statistics_from_parquet_metadata(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2400, in data_file_statistics_from_parquet_metadata
    del col_aggs[field_id]
        ~~~~~~~~^^^^^^^^^^
KeyError: 1

This is because since this parquet file does not have columns level stats sets, in the source code, it goes into the else block here So col_aggs and null_value_counts is not updated, but invalidate_col is update. So when the del command is run here, the KeyError is thrown.

As discussed on slack, @kevinjqliu proposed to switch del col_aggs[field_id] with col_aggs.pop(field_id, None).

I will be raising a PR soon.