Are invalidated columns tracked properly during data file creation?

cgbur commented 4 days ago

Apache Iceberg version

None

Please describe the bug 🐞

In the data_file_statistics_from_parquet_metadata function, we track the invalidate_col set which is largely present to mark when a column row group does not have statistic set, therefore we can not know the overall statistics accurately. I have a concern with the current implementation that I would like to hear those with more experience in this fields opinion on.

The current implementation resets the invalidated columns every row group while only deleting the invalid columns after all row groups have been processed. I can imagine a scenario where the first row group has no statistics, but the last row group does. In this scenario I think invalid iceberg metadata would be generated which includes metrics for columns that do not contain statistics in all groups. I think the fix for this is to hoist the invalid columns set out of the loops to avoid resetting the variable every row group.

currently:

def data_file_statistics_from_parquet_metadata():
    for r in range(parquet_metadata.num_row_groups):
        invalidate_col: Set[int] = set() # reset every row group
        for pos in range(parquet_metadata.num_columns):
            if column.is_stats_set:
            else:
                invalidate_col.add(field_id)

    for field_id in invalidate_col:
        del col_aggs[field_id]
        del null_value_counts[field_id]

proposal:

def data_file_statistics_from_parquet_metadata():
    invalidate_col: Set[int] = set() # moved so now tracks for whole file
    for r in range(parquet_metadata.num_row_groups):
        for pos in range(parquet_metadata.num_columns):
            if column.is_stats_set:
            else:
                invalidate_col.add(field_id)

    for field_id in invalidate_col:
        del col_aggs[field_id]
        del null_value_counts[field_id]

Fokko commented 4 days ago

Hey @cgbur that's a great point that you're making here. I don't think this is taken into consideration, and might even lead to invalid data. Did you encounter this in practice, if so, do you know how to reproduce this?

The statistics are very fundamental to Iceberg, and having no statistics will introduce a lot of inefficiency in query planning and actual data processing (the file will be included in every query plan).

cgbur commented 4 days ago

I have not come up with a reproducible example but discovered this when I found a bug in my code because pyarrow does not write stats for columns where all the values are null. I suspected that if a row group is null then it may do the same, but in that case it still writes out stats. I think whether you should write out stats for all null columns is an issue for that crate.

I do think its valid in the parquet spec to write stats for column A in row group 0 but not in row group 1 etc. I can not find it in the documentation of the spec but have no reason to believe this is invalid. In all likelihood, none of the current big writers do this. However, I do think the current behavior of reseting every row group and only using the results from the last row group is incorrect and should likely be changed. Currently stats are invalidated for a column if the last row group has no statistics, when I think the intention was to invalidate a columns stats if any of the row groups for that column has no statistics.

apache / iceberg-python

Are invalidated columns tracked properly during data file creation? #862

Apache Iceberg version

Please describe the bug 🐞