Open aokolnychyi opened 2 years ago
FYI @nastra and @rymurr also.
I agree with the approach of defining gc.enabled behavior more clearly and adding owned locations. If a file is in an owned location, then the table should always be able to delete it (though it may not choose to based on catalog policy).
cc @openinx @stevenzwu
If a file is in an owned location, then the table should always be able to delete it
I agree with that but I'd phrase it slightly differently. It see it more like a consequence. I'd say gc.enabled
should control whether we can delete data and delete files (i.e. expire snapshots). Metadata files are considered always owned and can be deleted during table purge. A list of owned locations indicate what locations can be scanned for orphan files and what locations can be deleted during table purge or removal of reachable files.
That being said, if gc.enabled
is true and a data file is in a shared location not owned by the table, I think we should still allow its removal. We can't just drop the entire location, though.
For expire snapshots, I think it's possible that we do not delete any manifest or data files in the case that those files are still referenced by later snapshots. If we don't allow expiring snapshots when gc.enabled is false, we would prevent users from performing the metadata update to remove snapshots which may be limiting.
Should we differentiate between performing the metadata operation to expire snapshots and performing the following deletion of unreachable files due to the expiration? Tbh I can't really think of a solid use case where someone would want this differentiation for expiring snapshots, but just bringing it up while we're having this discussion.Please let me know if I misunderstood something!
Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts:
I think as of today almost all the Iceberg users that tune this parameter are interpreting gc.enabled=false
as "disallow removal of data and delete files" as @aokolnychyi suggested, so it only makes sense for us to throw exception in every place that deletes data of a table if gc.enabled
is false.
I think there are only 2 cases we need to consider:
For 1, we should follow the definition of gc.enabled
and throw for all cases.
For 2, see section 3, I think this is not really a table related operation.
The current behavior in CatalogUtil
of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track down the file locations, making everything just orphan files. @aokolnychyi you said:
However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.
Could you provide some use cases where this is useful in addition to the recovery as a Hive table?
At a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far:
The fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action remove_orphan_files(table)
with the assumption that table files are under the same root prefix, which works for Hive-like Iceberg tables. But after all orphan file removal is not a table operation, but a storage operation. Once a file is orphan, it no longer belongs to a table. We remove orphan files to save storage cost, not to make any aspect of the table better. We just remove by table using an assumption that is for Hive table, and now we try to make the Iceberg spec work with it.
I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature.
This is basically talking about the VACUUM
command without referring to a table. I think that's also what most managed data warehouse products on the market offer for storage cleanup. VACUUM table
only makes sense for things like snapshot expiration, index cleanup, which do not rely on table root location mutual exclusion. If it's across storage you just do it for each. At least we will start to propose and provide such a feature for S3, for the other storage I think it's also not hard to provide an implementation once the interface is solidified.
Another use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact.
Therefore from security perspective, table root path based file ownership should just be a type of ownership mode.
From S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved.
If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes.
Maybe S3 can in the future solve the cold partition issue with some keep-warm feature, but it will come with a pricing cost for sure, and my biggest concern is that people will start to build tools that only work for tables with declared root locations (we already see this in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout. I don't know what's the future of storage, maybe a new product will come in the market with a completely different access pattern. Currently Iceberg's minimum storage feature requirement is very resilient to integrate with any storage, adding the concept of prefix ownership is basically preferring a file-system like storage.
There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer.
I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity.
One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table.
Sorry, was distracted. Let me catch up.
The current behavior in CatalogUtil of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track daown the file locations, making everything just orphan files.
That's generally true but there is a valid use case for keeping data files: you may use SNAPSHOT command in Spark that would create an Iceberg table pointing to non-Iceberg files. Those referenced data files may belong to other prod Hive tables. We allow to create Iceberg metadata to play with that data but we can't remove the imported data files. It may corrupt original tables.
I think the correct way to run remove orphan files is to do it for the entire warehouse.
I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise.
I'd be interested to hear more from other folks too.
As @aokolnychyi suggested in 3056, we use DeleteReachableFiles
to purge table data which could provide much more scalability and performance. While there still some drawbacks that need to consider:
HadoopCatalog
/HadoopTables
delete the whole warehouse directly and ignore the purge argument. In this case, we could not use DeleteReachableFiles
.DeleteReachableFiles
we will ignore those operations.I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too.
I think this is necessary. We should unify the built-in catalog behavior of the drop table [purge]. And maybe need to define the interface to support some parallel operations (by leveraging distributed engine, such as spark/flink/more).
Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality?
I think the correct way to run remove orphan files is to do it for the entire warehouse.
I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise.
This is an interesting idea I also considered at one point, having the option to provide the whole listing for S3-based storage bucket, filter out for all locations owned by known tables, and run remove orphans on those locations? If you know ahead of time your 'bucket' is majority files iceberg tables, might be less expensive overall (though this sounds like a single monster job if we are bottlenecked at the speed of physically deleting the files). Another consideration is that system like S3 inventory will be stale in order of hours or days.
Though it still seems to go back to the the main question, how to define the "owned locations" of a table? (Sorry, please let me know if there is discussion on it already I missed). That would be great to have, one thing I struggled in the past is "alter table location", if the previous locations can somehow be saved in a growing list it could allow to run more complete orphan removal jobs.
Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality?
drop or purge thinks should be the same mean for the path-based tables. In traditional database, the drop will move the table into the recycle bin and could restore the table with some command. However, purge will clean all data and cannot restore it. Maybe we should refer to such a design?
Thanks for this thread.
One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used.
This makes sense but we should define what this list means. There are some questions that come to my mind.
Incidentally, most of these questions also came up in the dev discussion on relative paths/DR strategy. A single location prefix support will be added as part of Phase 1
. If you agree, maybe I can decouple location prefix support with relative paths and create PR just for that change first.
Supporting multiple roots is mentioned in Phase 2
in this design doc. Implementation is yet to be discussed. It maybe a good idea to discuss it here.
Coming back to the problem, assuming we have such a list of owned table locations, how would we make sure all the files in these locations are indeed owned by Iceberg table? In other words, can we safely delete all
files in these locations?
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
The question about location ownership and file removal comes up in a lot of discussions. See here for an example.
Right now, our interpretation of
gc.enabled
is not consistent.HadoopTables
ignores it while purging files.CatalogUtil
keeps data files ifgc.enabled
is true while purging files and removes all metadata files.DeleteReachableFiles
and other actions throw an exception ifgc.enabled
is true.I think we should make the behavior consistent. My initial thoughts are below.
gc.enabled
to prohibit dangerous actions. For example,DeleteOrphanFiles
should throw an exception if garbage collection is disabled. Once we know what locations owned by the table, we can reconsider that check.Here is a list of places that may physically remove files.
Expire snapshots
It should not be possible to expire snapshots if
gc.enabled
is false.Delete orphan files
For now, we should continue to throw an exception if
gc.enabled
is false. Once we know what prefixes are owned by the table, we can allow removal of orphan files in locations that are owned by the table.Delete reachable files
It shouldn't be possible to delete data and delete files if garbage collection is disabled. However, we may consider allowing removal of metadata when
gc.enabled
is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.Drop and purge tables
I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too.
cc @jackye1995 @rdblue @pvary @RussellSpitzer @flyrain @szehon-ho @danielcweeks @karuppayya @SinghAsDev