v1 table data file spec id is None

puchengy commented 1 year ago

Apache Iceberg version

None

Please describe the bug 🐞

v1 data file spec_id is optionally, but it seems spark is able to recognize the spec_id, but pyiceberg can't, any idea why?

spark

spark-sql> select * from pyang.test_ray_iceberg_read.files;
content file_path   file_format spec_id partition   record_count    file_size_in_bytes  column_sizes    value_counts    null_value_counts   nan_value_counts    lower_bounds    upper_bounds    key_metadata    split_offsets   equality_ids    sort_order_id   readable_metrics
0   s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet PARQUET 1   {"dt":"2022-01-02","userid_bucket_16":4}    1   871 {1:36,2:37,3:46}    {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}  {1:,2:2,3:2022-01-02}   {1:,2:2,3:2022-01-02}   NULL    [4] NULL    0   {"col":{"column_size":37,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2","upper_bound":"2"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-02","upper_bound":"2022-01-02"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":2}}
0   s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet    PARQUET 0   {"dt":"2022-01-01","userid_bucket_16":null} 1   870 {1:36,2:36,3:46}    {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}  {1:,2:1,3:2022-01-01}   {1:,2:1,3:2022-01-01}   NULL    [4] NULL    0   {"col":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"1","upper_bound":"1"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-01","upper_bound":"2022-01-01"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}}
Time taken: 0.494 seconds, Fetched 2 row(s)

pyiceberg (0.4.0)

>>> tasks2[0]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet', file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-02', userid_bucket_16=4], record_count=1, file_size_in_bytes=871, column_sizes={1: 36, 2: 37, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: b'2022-01-02'}, upper_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: b'2022-01-02'}, key_metadata=None, split_offsets=[4], sort_order_id=0, content=DataFileContent.DATA, equality_ids=None, spec_id=None], delete_files=set(), start=0, length=871)
>>> tasks2[1]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet', file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-01'], record_count=1, file_size_in_bytes=870, column_sizes={1: 36, 2: 36, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: b'2022-01-01'}, upper_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: b'2022-01-01'}, key_metadata=None, split_offsets=[4], sort_order_id=0, content=DataFileContent.DATA, equality_ids=None, spec_id=None], delete_files=set(), start=0, length=870)

Fokko commented 1 year ago

Hey @puchengy thanks for raising this!

I was unsure about this because 141: spec-id is not mentioned in the spec, but it looks like we can add it: https://github.com/apache/iceberg/pull/8730

puchengy commented 1 year ago

@Fokko Hi, I thought we already have that https://github.com/apache/iceberg/blob/pyiceberg-0.4.0rc2/python/pyiceberg/manifest.py#L162 or is this not what you meant?

puchengy commented 1 year ago

@Fokko And based on the https://github.com/apache/iceberg/pull/8730 it seems that we would like to inherent the spec id from manifest file as well? https://github.com/apache/iceberg-python/blob/ce8535851653b7c0290b8222f40a4c3e507ba39e/pyiceberg/manifest.py#L497

puchengy commented 1 year ago

@Fokko do you know?

Fokko commented 1 year ago

@puchengy Sorry for not replying. I think we can include this in the next release, it shouldn't be too hard to carry this information from the manifest-list

apache / iceberg-python

v1 table data file spec id is None #46

Apache Iceberg version

Please describe the bug 🐞