Support client-side purge in REST catalog

flyrain commented 3 months ago

Proposed Change

The current Rest clients relies on the rest server to delete table files while dropping a table with purging. There are two concerns about this approach:

The rest server isn't necessarily able to access users' storage. It's impossible to delete table files if the server doesn't have the permission.
The rest server may take a performance hit in case of purging table with a large amount of files.

I propose to support the client-side purging, while still allowing server side deletion to be compatible with the current behavior.

Option 1, to put the purge state in a delete table response.

DeleteTableResponse:
  type: object
  properties:
    purged:
      type: boolean

The clients can decide to delete files or not according to the response. If files are deleted in the server side, do nothing; otherwise, delete them in the client side.

Option 2, checking the existence of table files in the client side

The client can check if files exist, then decide to delete them or not. This doesn't need spec changes. Clients will rely on a convention instead of spec, which is a bit ambiguous.

WDYT? Please share your feedback.

cc @RussellSpitzer @aokolnychyi @rdblue @danielcweeks @Fokko

Proposal document

No response

Specifications

[ ] Table
[ ] View
[X] REST
[ ] Puffin
[ ] Encryption
[ ] Other

danielcweeks commented 3 months ago

@flyrain I'm a little confused, how can the REST Server not have access to the files? Currently the server needs access to at least the metadata files. Are you considering a situations where data files and metadata files are protected separately?

The way we've been thinking about REST puts the responsibility of the delete on the server (the client shouldn't be responsible for how or when the delete happens).

flyrain commented 3 months ago

That's right. In our case, the rest server cannot access every table file due to following reasons:

The rest catalog or any other catalog isn't allowed to access users' data due to the security policy, metadata access is fine.
Some Iceberg tables are in HDFS with kerberos, which makes them pretty hard to access from a centralized server.

We still write metadata.json files, but they are located in a server-side storage instead of users' table storage. I understand this use case is a bit different from where the REST catalog was introduced, but I believe it is a valid use case, and we can extend the scope of rest catalog a bit more to support it. cc @RussellSpitzer

rdblue commented 2 weeks ago

@flyrain, I think your use case makes sense and that we should support some version of client-side purge. That said, I don't think that either option proposed here is the right solution. The problem with both is that this assumes that the purge needs to happen immediately, which isn't necessarily the case.

There's a lot of confusion about purging because in Hive there was no background process to clean up tables and file ownership wasn't clear. As a result, purge has conflicting meanings. It could be either that the table data is sensitive and needs to be deleted immediately, or it could be used to indicate that the data is owned by the table and should be cleaned up rather than left sitting in storage indefinitely. To make this worse, defaults are based on the second and more common interpretation: Iceberg's dropTable(Identifier) calls dropTable(identifier, true /* purge */) in the default implementation.

I want to avoid a case where we have purge-by-default trigger client behavior to actually delete files because catalogs can have much better handling now. For instance, our catalog will keep tables around for a few days that can be restored in case of accidental deletes. In that case, purge uses the first definition and if a client deleted all of the files immediately it would be a problem. We also have to ignore the client-side purge flag because we don't know whether it was defaulted or not.

To solve this, what about adding a config default property that can be sent back by the service? Then all you'd need to do is send a config to the client to tell it to purge tables itself because the service can't. Would that work for your case?

flyrain commented 6 days ago

Having a config to describe the server's capability sounds like a good idea. Although, I think this use case could be resolved in a different way.

our catalog will keep tables around for a few days that can be restored in case of accidental deletes.

Can we distinguish the behaviors of immediate deletion and soft deletion(Putting a table in a Trash Can) more explicitly? Users might have to be aware of that. The current solution seems a bit ambiguous in which users don't know if server actually does immediate deletion or not(it completely depends on the impl.). This is not OK when users have to delete a table immediately together with the data for compliance. I understand the default dropTable(Identifier) purges. Does it make sense to introduce a new method for soft deletion, so that users can invoke it explicitly?

apache / iceberg