apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.51k stars 2.25k forks source link

Make ManifestEntry and ManifestReader.liveEntries() as public #10425

Open pudidic opened 5 months ago

pudidic commented 5 months ago

Hello,

I'm working on Apache Iceberg replication product. It transforms ManifestFiles on the destination, but it needs to keep the status, data file number, sequence number, and other details. To read the details, I need to access the ManifestEntry. Can we make it public?

Sincerely yours, Teddy Choi

ajantha-bhat commented 5 months ago

I think initially it was designed that ManifestFiles will be local to iceberg-core module and we can have a public API's around it that are exposed to other modules/users.

While working on partition stats at iceberg-data module, I had to add some functionality in iceberg-core module to transform manifests and use it in iceberg-data module. But in your case, if it is a custom transformation (not required for other iceberg modules), it makes sense to expose ManifestFiles.

So, I am generally +1 on this idea. Lets see what others think.

Fokko commented 5 months ago

I'm reluctant to make it part of the public API. Mostly because you need to have a good understanding of Iceberg to know what you're doing, and probably it is better to go through the Scan API. The liveEntries() is package private so you can still access it, you have to jump through a hoop.

amogh-jahagirdar commented 5 months ago

I'm also a bit reluctant for this particular case, because this should be achievable through the existing public APIs.

@pudidic If you want to read the entries wouldn't using the ManifestFiles.read() API be sufficient? Then you could iterate over the data files via the iterator API, and when copying preserve all the details you mentioned (like sequence number) by using the data files copy API.

Let me know if I'm missing something but at least for the case you're describing, that should be able to be done without modifying the public APIs.

pudidic commented 5 months ago

I need to rewrite the files with manifest entry status with details. The details include data file numbers, sequence numbers, and snapshot number, which are not in the DataFile or DeleteFile instances. I need to access ManifestEntry interface to get those data.

A positional delete works with existing data files. It requires its data manifest entry with existing status. If the data manifest entry has added status, then the positional delete cannot mark the row as deleted.

ajantha-bhat commented 5 months ago

Another PR that requires it to be public: https://github.com/apache/iceberg/pull/10024/files#diff-2cb5b09259db131b4d85059ab8b42a9f3bdfd233bd565539cd78c7f485b44b5dR27