Open pudidic opened 5 months ago
I think initially it was designed that ManifestFiles
will be local to iceberg-core
module and we can have a public API's around it that are exposed to other modules/users.
While working on partition stats at iceberg-data
module, I had to add some functionality in iceberg-core
module to transform manifests and use it in iceberg-data
module. But in your case, if it is a custom transformation (not required for other iceberg modules), it makes sense to expose ManifestFiles
.
So, I am generally +1 on this idea. Lets see what others think.
I'm reluctant to make it part of the public API. Mostly because you need to have a good understanding of Iceberg to know what you're doing, and probably it is better to go through the Scan API. The liveEntries()
is package private so you can still access it, you have to jump through a hoop.
I'm also a bit reluctant for this particular case, because this should be achievable through the existing public APIs.
@pudidic If you want to read the entries wouldn't using the ManifestFiles.read() API be sufficient? Then you could iterate over the data files via the iterator
API, and when copying preserve all the details you mentioned (like sequence number) by using the data files copy
API.
Let me know if I'm missing something but at least for the case you're describing, that should be able to be done without modifying the public APIs.
I need to rewrite the files with manifest entry status with details. The details include data file numbers, sequence numbers, and snapshot number, which are not in the DataFile or DeleteFile instances. I need to access ManifestEntry interface to get those data.
A positional delete works with existing data files. It requires its data manifest entry with existing status. If the data manifest entry has added status, then the positional delete cannot mark the row as deleted.
Another PR that requires it to be public: https://github.com/apache/iceberg/pull/10024/files#diff-2cb5b09259db131b4d85059ab8b42a9f3bdfd233bd565539cd78c7f485b44b5dR27
Hello,
I'm working on Apache Iceberg replication product. It transforms ManifestFiles on the destination, but it needs to keep the status, data file number, sequence number, and other details. To read the details, I need to access the ManifestEntry. Can we make it public?
Sincerely yours, Teddy Choi