Traceback (most recent call last):
File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 42, in <module>
main()
File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 29, in main
table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1036, in add_files
tx.add_files(
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 594, in add_files
for data_file in data_files:
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1537, in _parquet_files_to_data_files
yield from parquet_files_to_data_files(io=io, table_metadata=table_metadata, file_paths=iter(file_paths))
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2535, in parquet_files_to_data_files
statistics = data_file_statistics_from_parquet_metadata(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2400, in data_file_statistics_from_parquet_metadata
del col_aggs[field_id]
~~~~~~~~^^^^^^^^^^
KeyError: 1
This is because since this parquet file does not have columns level stats sets, in the source code, it goes into the else block here
So col_aggs and null_value_counts is not updated, but invalidate_col is update. So when the del command is run here, the KeyError is thrown.
As discussed on slack, @kevinjqliu proposed to switch del col_aggs[field_id] with col_aggs.pop(field_id, None).
Apache Iceberg version
0.8.0 (latest release)
Please describe the bug 🐞
Using the NYC taxi data set found here, if I follow the standard way of creating catalog, and table, but instead of doing
append
, I doadd_files
:I get a
KeyError
:This is because since this parquet file does not have columns level stats sets, in the source code, it goes into the else block here So col_aggs and null_value_counts is not updated, but invalidate_col is update. So when the del command is run here, the KeyError is thrown.
As discussed on slack, @kevinjqliu proposed to switch
del col_aggs[field_id]
withcol_aggs.pop(field_id, None)
.I will be raising a PR soon.