Write tool that cleans up parquet files not referenced in metadata folder

Background Many users will read from a structured streaming folder like this

spark.read.parquet("streaming-folder/*")

or from a partition sub-folder like this

spark.read.parquet("streaming-folder/partition1=value1/partition2=value2")

In both cases, streaming-folder/_spark_metadata is ignored and thus potential duplicates from incomplete microbatches might be read. One way to mitigate this problem is to periodically remove any "orphaned" parquet files which are not referenced by the _spark_metadata/ log. Care must be taken not to wrongly decide a parquet file is orphaned, even though it isn't.

Details

Tool should be started independently from Hyperdrive (maybe it should even go to a separate repo)
Compare metadata log and destination folder
Delete parquet files not referenced in the metadata folder
Print a warning, but don't delete anything if a parquet file referenced in the metadata cannot be located (prevent accidental deletion due to invalid metadata paths after folder move)
Don't delete "recent" parquet files (prevent accidental deletion of parquet files that are part of an active microbatch and haven't been committed yet)
For example, don't delete any parquet files which are younger than the last referenced parquet file

AbsaOSS / hyperdrive

Write tool that cleans up parquet files not referenced in metadata folder #226