apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
474 stars 97 forks source link

fix (manifest-list): added serde aliases to support both forms conventions #365

Closed a-agmon closed 1 month ago

a-agmon commented 1 month ago

This PR fixes #338 The PR adds serde aliases to fields that were renamed in https://github.com/apache/iceberg/pull/5338 (merged in 1.5.0), in order to support both conventions/versions - i.e., new and previous naming conventions.

In the current implementation of Iceberg-Rust, deserialization is done using field names rather than field id, and therefore manifests written using newer Iceberg lib are not supported.

sdd commented 1 month ago

Looks good, at least until we move to field IDs. But could we add a test data file to check the aliases work? 👍🏼

a-agmon commented 1 month ago

Looks good, at least until we move to field IDs. But could we add a test data file to check the aliases work? 👍🏼

Absolutely. Thanks

a-agmon commented 1 month ago

Added test files and test. Thanks @sdd

a-agmon commented 1 month ago

I'm okay with this, but before the next release, we need to get field-ID resolution in place. That's a very important part of Iceberg. In this case, we stumbled into a naming difference of Spark, but it might also be that other engines produce slightly different names.

Thanks @Fokko , Typo fixed and Yes, this is indeed aimed to be an interim solution, though the test and test files should be good to help us test a more robust solution.

With respect to the more long term solution that resolves the field name by field-id, I have patched this for my own use case, and will be happy to PR, but would be grateful for some "pre-ruling" on the direction and design of the solution. Mostly because there some open questions about this, like the use of apace_avro crate etc.

Anyways, here is the note about the suggested direction https://github.com/apache/iceberg-rust/issues/338#issuecomment-2095268461

zeodtr commented 1 month ago

@a-agmon I think the PR is simple and clear. BTW, it would be nice to change the field names of the ManifestFileV2 struct to the 'correct' ones (for example, pub added_data_files_count to pub added_files_count but it would require changes in other modules.

Fokko commented 1 month ago

BTW, it would be nice to change the field names of the ManifestFileV2 struct to the 'correct' ones (for example, pub added_data_files_count to pub added_files_count but it would require changes in other modules.

I would like that as well, maybe we can do that in a separate PR so we can get this one in.

With respect to the more long term solution that resolves the field name by field-id, I have patched this for my own use case, and will be happy to PR, but would be grateful for some "pre-ruling" on the direction and design of the solution. Mostly because there some open questions about this, like the use of apace_avro crate etc.

I think the solution that you posted makes sense. In PyIceberg we decided, in the end, to implement our own Avro reader to make these things easier. What we do there is convert an Avro schema directly to an Iceberg schema and use an Iceberg Avro reader (that you would also use for data-files), to read the Avro files into the right struct`s.