Define behavior of gc.enabled and location ownership

aokolnychyi commented 2 years ago

The question about location ownership and file removal comes up in a lot of discussions. See here for an example.

Right now, our interpretation of gc.enabled is not consistent.

HadoopTables ignores it while purging files.
CatalogUtil keeps data files if gc.enabled is true while purging files and removes all metadata files.
DeleteReachableFiles and other actions throw an exception if gc.enabled is true.

I think we should make the behavior consistent. My initial thoughts are below.

One way to interpret disabled garbage collection is to disallow removal of data and delete files. For instance, it should not be possible to expire snapshots.
We should add a list of location prefixes that are owned by the table to our metadata. Until that is done, we can use gc.enabled to prohibit dangerous actions. For example, DeleteOrphanFiles should throw an exception if garbage collection is disabled. Once we know what locations owned by the table, we can reconsider that check.

Here is a list of places that may physically remove files.

Expire snapshots

It should not be possible to expire snapshots if gc.enabled is false.

Delete orphan files

For now, we should continue to throw an exception if gc.enabled is false. Once we know what prefixes are owned by the table, we can allow removal of orphan files in locations that are owned by the table.

Delete reachable files

It shouldn't be possible to delete data and delete files if garbage collection is disabled. However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.

Drop and purge tables

I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too.

cc @jackye1995 @rdblue @pvary @RussellSpitzer @flyrain @szehon-ho @danielcweeks @karuppayya @SinghAsDev

rdblue commented 2 years ago

FYI @nastra and @rymurr also.

rdblue commented 2 years ago

I agree with the approach of defining gc.enabled behavior more clearly and adding owned locations. If a file is in an owned location, then the table should always be able to delete it (though it may not choose to based on catalog policy).

aokolnychyi commented 2 years ago

cc @openinx @stevenzwu

aokolnychyi commented 2 years ago

If a file is in an owned location, then the table should always be able to delete it

I agree with that but I'd phrase it slightly differently. It see it more like a consequence. I'd say gc.enabled should control whether we can delete data and delete files (i.e. expire snapshots). Metadata files are considered always owned and can be deleted during table purge. A list of owned locations indicate what locations can be scanned for orphan files and what locations can be deleted during table purge or removal of reachable files.

That being said, if gc.enabled is true and a data file is in a shared location not owned by the table, I think we should still allow its removal. We can't just drop the entire location, though.

amogh-jahagirdar commented 2 years ago

For expire snapshots, I think it's possible that we do not delete any manifest or data files in the case that those files are still referenced by later snapshots. If we don't allow expiring snapshots when gc.enabled is false, we would prevent users from performing the metadata update to remove snapshots which may be limiting.

Should we differentiate between performing the metadata operation to expire snapshots and performing the following deletion of unreachable files due to the expiration? Tbh I can't really think of a solid use case where someone would want this differentiation for expiring snapshots, but just bringing it up while we're having this discussion.Please let me know if I misunderstood something!

jackye1995 commented 2 years ago

Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts:

1. gc.enabled

I think as of today almost all the Iceberg users that tune this parameter are interpreting gc.enabled=false as "disallow removal of data and delete files" as @aokolnychyi suggested, so it only makes sense for us to throw exception in every place that deletes data of a table if gc.enabled is false.

I think there are only 2 cases we need to consider:

remove reacheable files, which includes
1. purge table
2. expire snapshot
3. any file deletion based on the Iceberg metadata tree
remove orphan files

For 1, we should follow the definition of gc.enabled and throw for all cases. For 2, see section 3, I think this is not really a table related operation.

2. Remove metadata files by config

The current behavior in CatalogUtil of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track down the file locations, making everything just orphan files. @aokolnychyi you said:

However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.

Could you provide some use cases where this is useful in addition to the recovery as a Hive table?

3. Table location ownership

At a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far:

remove orphan files

The fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action remove_orphan_files(table) with the assumption that table files are under the same root prefix, which works for Hive-like Iceberg tables. But after all orphan file removal is not a table operation, but a storage operation. Once a file is orphan, it no longer belongs to a table. We remove orphan files to save storage cost, not to make any aspect of the table better. We just remove by table using an assumption that is for Hive table, and now we try to make the Iceberg spec work with it.

I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature.

This is basically talking about the VACUUM command without referring to a table. I think that's also what most managed data warehouse products on the market offer for storage cleanup. VACUUM table only makes sense for things like snapshot expiration, index cleanup, which do not rely on table root location mutual exclusion. If it's across storage you just do it for each. At least we will start to propose and provide such a feature for S3, for the other storage I think it's also not hard to provide an implementation once the interface is solidified.

table access permission

Another use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact.

Therefore from security perspective, table root path based file ownership should just be a type of ownership mode.

The problem of declaring table location ownership

From S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved.

If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes.

Maybe S3 can in the future solve the cold partition issue with some keep-warm feature, but it will come with a pricing cost for sure, and my biggest concern is that people will start to build tools that only work for tables with declared root locations (we already see this in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout. I don't know what's the future of storage, maybe a new product will come in the market with a completely different access pattern. Currently Iceberg's minimum storage feature requirement is very resilient to integrate with any storage, adding the concept of prefix ownership is basically preferring a file-system like storage.

There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer.

I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity.

One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table.

aokolnychyi commented 2 years ago

Sorry, was distracted. Let me catch up.

aokolnychyi commented 2 years ago

The current behavior in CatalogUtil of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track daown the file locations, making everything just orphan files.

That's generally true but there is a valid use case for keeping data files: you may use SNAPSHOT command in Spark that would create an Iceberg table pointing to non-Iceberg files. Those referenced data files may belong to other prod Hive tables. We allow to create Iceberg metadata to play with that data but we can't remove the imported data files. It may corrupt original tables.

I think the correct way to run remove orphan files is to do it for the entire warehouse.

I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise.

I'd be interested to hear more from other folks too.

ConeyLiu commented 2 years ago

As @aokolnychyi suggested in 3056, we use DeleteReachableFiles to purge table data which could provide much more scalability and performance. While there still some drawbacks that need to consider:

Different catalog has a different implementation for drop table. For example, HadoopCatalog/HadoopTables delete the whole warehouse directly and ignore the purge argument. In this case, we could not use DeleteReachableFiles.
User self catalog may have some customized features, such as sending event/metrics when purging data. With DeleteReachableFiles we will ignore those operations.

I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too.

I think this is necessary. We should unify the built-in catalog behavior of the drop table [purge]. And maybe need to define the interface to support some parallel operations (by leveraging distributed engine, such as spark/flink/more).

aokolnychyi commented 2 years ago

Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality?

szehon-ho commented 2 years ago

I think the correct way to run remove orphan files is to do it for the entire warehouse.

I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise.

This is an interesting idea I also considered at one point, having the option to provide the whole listing for S3-based storage bucket, filter out for all locations owned by known tables, and run remove orphans on those locations? If you know ahead of time your 'bucket' is majority files iceberg tables, might be less expensive overall (though this sounds like a single monster job if we are bottlenecked at the speed of physically deleting the files). Another consideration is that system like S3 inventory will be stale in order of hours or days.

Though it still seems to go back to the the main question, how to define the "owned locations" of a table? (Sorry, please let me know if there is discussion on it already I missed). That would be great to have, one thing I struggled in the past is "alter table location", if the previous locations can somehow be saved in a growing list it could allow to run more complete orphan removal jobs.

ConeyLiu commented 2 years ago

Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality?

drop or purge thinks should be the same mean for the path-based tables. In traditional database, the drop will move the table into the recycle bin and could restore the table with some command. However, purge will clean all data and cannot restore it. Maybe we should refer to such a design?

anuragmantri commented 2 years ago

Thanks for this thread.

One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used.

This makes sense but we should define what this list means. There are some questions that come to my mind.

Do these locations mean a table can have multiple root locations simultaneously?
How does this work when object store path is used since users may have same object store path bucket for multiple tables?
In case of DR, the table is replicated to a different location. Are all of these replicated locations part of this list?

Incidentally, most of these questions also came up in the dev discussion on relative paths/DR strategy. A single location prefix support will be added as part of Phase 1. If you agree, maybe I can decouple location prefix support with relative paths and create PR just for that change first.

Supporting multiple roots is mentioned in Phase 2 in this design doc. Implementation is yet to be discussed. It maybe a good idea to discuss it here.

Coming back to the problem, assuming we have such a list of owned table locations, how would we make sure all the files in these locations are indeed owned by Iceberg table? In other words, can we safely delete all files in these locations?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 2 years ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg