databrickslabs / discoverx

A Swiss-Army-knife for your Data Intelligence platform administration.
Other
104 stars 11 forks source link

Scan across table versions #48

Open tdikland opened 1 year ago

tdikland commented 1 year ago

I would love to be able to scan across all my active versions (i.e. not vacuumed) of my (Delta) tables to make sure that I classify/discover all data that can theoretically be accessed by users with SELECT permission on the table.

One of the areas where I feel this is a key capability is around the GDPR use case. DiscoverX already exposes functionality to remove rows across tables (which is awesome!), but the current documentation rightly mentions that a vacuum is needed to truly make these rows inaccessible. In situations where the responsibility for removing rows for GDPR and auditing GDPR are separated, it would be great for the second group to have functionality available to check whether the required vacuum operation has run (i.e. the data is no longer accessible). Other use cases could be the case where a user mistakingly adds unwanted data to a table and after realising that deletes these rows without running vacuum. This poses similar challenges as the aforementioned use case.

edurdevic commented 1 year ago

We don't have in plan to query the history versions, that would require quite some logic to be efficient (avoid re-processing the same rows for each version). We are planning thought to add support for running vacuum over multiple tables, so that it can be scheduled easily across multiple tables with dx.vacuum(from_tables="*.*.*")