Open broeks-m opened 1 year ago
This function would update all parquet files/logs containing data on records that match the filter criteria and, in the case of an update, add the new records.
Parquet files are immutable, we can't mutate them to update them or delete rows.
Sorry, I didn't use the right terminology earlier. What I meant to say was that this function should replace all parquet files/logs containing records that match the filter criteria with cleaned-up versions of the parquet files/logs. These cleaned-up parquet files/logs should contain all the records that were present before the filter operation, except for the ones that match the filter criteria.
For example, let's say I have a delta table containing my personal information combined with an ID {1, "broeks-m"}. When I execute the function to delete my data, I would like it to:
Any other solutions that could help us permanently remove or update records in a delta table are more than welcome. I'm open to suggestions and would be happy to expand on the function I'm describing.
Sorry, I didn't use the right terminology earlier. What I meant to say was that this function should replace all parquet files/logs containing records that match the filter criteria with cleaned-up versions of the parquet files/logs. These cleaned-up parquet files/logs should contain all the records that were present before the filter operation, except for the ones that match the filter criteria.
For example, let's say I have a delta table containing my personal information combined with an ID {1, "broeks-m"}. When I execute the function to delete my data, I would like it to:
- Read the parquet files containing my data into memory.
- Remove my personal information from the in-memory dataset.
- Write the filtered in-memory dataset to a new parquet file.
- Update any necessary logs and have them reference the new parquet file.
- Remove the old parquet files containing my personal data.
Any other solutions that could help us permanently remove or update records in a delta table are more than welcome. I'm open to suggestions and would be happy to expand on the function I'm describing.
Hello, it is great that you bring this up as this is one of many important data compliant requirement a LakeHouse storage format like Delta should address. Unfortunately, it does not seem to draw much attention until now.
For data compliant requirement in general, I don't think we really need to remove records from all versions in history, but we must guarantee that the supposed-to-be-deleted records are not brought back to our table with just a simple Time Travel. As the physical storage file of those records will be removed after 30 days (default history retention time) + a VACUUM.
My suggesting solution is to have a new command call PERM-DELETE, this command allows to delete the target records in current version of the table + and preventing the re-appearance of those targets on all active history versions. To do so, we have to add the transaction logs of this PERM-DELETION command into not the latest created version, but to the transaction logs of all active history versions. There will be issue with checkpoints - a background process of Delta does compute checkpoints as a stable version of our tables, for better performance. For those checkpoints, we have to modify the way checkpoint is read in case we Time Travel back to a checkpointed version. Since it is not simple to just read the parquet files anymore, but in case there is an extra PERM-DELETE transaction log, the checkpointed version will be read as a regular version with one extra transaction log (the PERM-DELETE transaction logs).
For the concern of content conflict, we should make it simple as the goal of PERM-DELETE is just to avoid having back the supposed-to-be-deleted records when time traveling back to a specific version. Therefore, all the transaction logics should be calculated, then we apply on top of the result the PERM-DELETE transaction log. For instance, if V5 is when we ingested record Rd. Up to version V10, Rd is still in the table. Then we decide to PERM-DELETE the records Rd, Rd2, Rd3. At some point, if we time travel back to V5, the logic to compute back the version should be to apply all V5 transaction logs (including adding Rd). Once all regular transactions were done, we apply the PERM-DELETE transaction log condition.
Final point, there will be edge cases like a deletion condition which is not deterministic. However, the PERM-DELETE command should be a command to be used with extreme care. And we normally use it for compliance, and those always come with very specific deletion condition (specific ID, PII, etc.)
Happy to receive your feedback Cheers
Feature request
Overview
I would like to have the ability to permanently delete specific records from a delta table while retaining the history of other records. Alternatively, I would like to be able to update all versions of specific records.
Motivation
All European companies/instances must comply with the General Data Protection Regulation (GDPR). Article 17 of the GDPR states: “The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay”. This means that a citizen of a European country can ask a European-based company/instances to have their personal data permanently deleted. Company/instances have to oblidge by this rule and need to remove the requested data.
Further details
Compliance with this law currently requires deleting a record from a delta table and performing a VACUUM on the same table. However, this approach removes the version history of all records in that table, which overshoots the intended goal.
The desired feature is the ability to permanently delete one or more records from the data table, including the history of those records. Alternatively, it should be possible to permanently update one or more records, which would delete all prior versions of those records and replace the record(s) with a passed parameter.
Ideally, the implementation of this feature would be a specific permanent delete/update function that performs the delete/update action based on a filter. In the case of an update, an additional parameter option should be added that contains the updated value. This function would update all parquet files/logs containing data on records that match the filter criteria and, in the case of an update, add the new records.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Would love to help, but I don't have any coding experience in scala/java