Open flyrain opened 3 months ago
@flyrain I'm a little confused, how can the REST Server not have access to the files? Currently the server needs access to at least the metadata files. Are you considering a situations where data files and metadata files are protected separately?
The way we've been thinking about REST puts the responsibility of the delete on the server (the client shouldn't be responsible for how or when the delete happens).
That's right. In our case, the rest server cannot access every table file due to following reasons:
We still write metadata.json files, but they are located in a server-side storage instead of users' table storage. I understand this use case is a bit different from where the REST catalog was introduced, but I believe it is a valid use case, and we can extend the scope of rest catalog a bit more to support it. cc @RussellSpitzer
@flyrain, I think your use case makes sense and that we should support some version of client-side purge. That said, I don't think that either option proposed here is the right solution. The problem with both is that this assumes that the purge needs to happen immediately, which isn't necessarily the case.
There's a lot of confusion about purging because in Hive there was no background process to clean up tables and file ownership wasn't clear. As a result, purge has conflicting meanings. It could be either that the table data is sensitive and needs to be deleted immediately, or it could be used to indicate that the data is owned by the table and should be cleaned up rather than left sitting in storage indefinitely. To make this worse, defaults are based on the second and more common interpretation: Iceberg's dropTable(Identifier)
calls dropTable(identifier, true /* purge */)
in the default implementation.
I want to avoid a case where we have purge-by-default trigger client behavior to actually delete files because catalogs can have much better handling now. For instance, our catalog will keep tables around for a few days that can be restored in case of accidental deletes. In that case, purge uses the first definition and if a client deleted all of the files immediately it would be a problem. We also have to ignore the client-side purge flag because we don't know whether it was defaulted or not.
To solve this, what about adding a config default property that can be sent back by the service? Then all you'd need to do is send a config to the client to tell it to purge tables itself because the service can't. Would that work for your case?
Having a config to describe the server's capability sounds like a good idea. Although, I think this use case could be resolved in a different way.
our catalog will keep tables around for a few days that can be restored in case of accidental deletes.
Can we distinguish the behaviors of immediate deletion and soft deletion(Putting a table in a Trash Can
) more explicitly? Users might have to be aware of that. The current solution seems a bit ambiguous in which users don't know if server actually does immediate deletion or not(it completely depends on the impl.). This is not OK when users have to delete a table immediately together with the data for compliance. I understand the default dropTable(Identifier)
purges. Does it make sense to introduce a new method for soft deletion, so that users can invoke it explicitly?
Proposed Change
The current Rest clients relies on the rest server to delete table files while dropping a table with purging. There are two concerns about this approach:
I propose to support the client-side purging, while still allowing server side deletion to be compatible with the current behavior.
Option 1, to put the purge state in a delete table response.
The clients can decide to delete files or not according to the response. If files are deleted in the server side, do nothing; otherwise, delete them in the client side.
Option 2, checking the existence of table files in the client side
The client can check if files exist, then decide to delete them or not. This doesn't need spec changes. Clients will rely on a convention instead of spec, which is a bit ambiguous.
WDYT? Please share your feedback.
cc @RussellSpitzer @aokolnychyi @rdblue @danielcweeks @Fokko
Proposal document
No response
Specifications