Make current-snapshot-id optional while maintaining backwards compatibility

apache / iceberg-rust

Apache Iceberg

https://rust.iceberg.apache.org/

Apache License 2.0

469 stars 95 forks source link

Make current-snapshot-id optional while maintaining backwards compatibility #374

Open s-akhtar-baig opened 1 month ago

s-akhtar-baig commented 1 month ago

Resolves https://github.com/apache/iceberg-rust/issues/352

Problem: Previous versions of Java (<1.4.0) implementations incorrectly assume the optional attribute current-snapshot-id to be a required attribute in TableMetadata.

Solution: Use legacy-current-snapshot-id environment variable to force iceberg-rust to create and load a table with a metadata file compatible with the older versions.

Testing: Added new unit tests.

Future work:

https://github.com/apache/iceberg-rust/issues/375: Implement a config file for iceberg-rust and provide users an additional option to set legacy-current-snapshot-id as a configuration value.

s-akhtar-baig commented 1 month ago

@Fokko, these changes make current-snapshot-id optional and uses a flag to maintain backwards compatibility. Let me know what you think.

Please note that, with the flag on we can create and load a table with -1 as the current-snapshot-id. This is different from pyiceberg which I believe only supports creating a table.

s-akhtar-baig commented 1 month ago

Also, do we have a preference on where we want to document the "backwards compatibility" section?

Fokko commented 1 month ago

Sorry for the late reply as I was touching grass.

We're trying to solve two problems:

Don't produce -1 since it is erroneous unless there is a snapshot with the ID -1. You can still create a table with -1 as the current snapshot ID if you set the flag. If you use Java <1.4.0 within your organization, this might be required to be able to read the tables.
When deserializing, and we encounter a -1 as the current snapshot ID, we should convert it into a None.

liurenjie1024 commented 1 month ago

cc @Xuanwo @Fokko @sdd @marvinlanhenke What do you think?

sdd commented 1 month ago

One problem with doing it through an env var is that it applies to every table you hit in your service. I think it would be better if it was a config param so that you can configure it per table.

s-akhtar-baig commented 1 month ago

@liurenjie1024 @sdd, sure, I can work on https://github.com/apache/iceberg-rust/issues/375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

liurenjie1024 commented 1 month ago

@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

@s-akhtar-baig I think the first step maybe to add a config for the serializer/deseriazer of table metadata to control this behavior. Config file is just one approach to init the config, we should decouple these two things.

s-akhtar-baig commented 1 week ago

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

liurenjie1024 commented 5 days ago

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

Hi, @s-akhtar-baig Thanks for the contribution. I have little concern about current approach because it seems not extensible to me. How about this:


pub struct TableMetadataParser {
   use_legacy_id: bool,
}

impl TableMetadataParser {
  pub async fn write(&self, output_file: OutputFile, table_metadata: &TableMetadata) {
   ....
 }

pub async fn read(&self, input_file: InputFile) -> Result<TableMetadata> {
  ....
}
}

Then instead of directly ser/de table metadata, we will control behavior using TableMetadataParser, what do you think?

s-akhtar-baig commented 5 days ago

@liurenjie1024, I see. It makes sense to me and I will make changes accordingly. Thank you for adding your feedback and providing sample code!