apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
480 stars 177 forks source link

Add table statistics #1285

Open ndrluis opened 3 weeks ago

ndrluis commented 3 weeks ago

The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.

ndrluis commented 3 weeks ago

I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation.

@Fokko @sungwy @kevinjqliu

ndrluis commented 2 weeks ago

@kevinjqliu could you please review it once more?

ndrluis commented 1 week ago

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update.

What do you think?