In both cases, streaming-folder/_spark_metadata is ignored and thus potential duplicates from incomplete microbatches might be read. One way to mitigate this problem is to periodically remove any "orphaned" parquet files which are not referenced by the _spark_metadata/ log. Care must be taken not to wrongly decide a parquet file is orphaned, even though it isn't.
Details
Tool should be started independently from Hyperdrive (maybe it should even go to a separate repo)
Compare metadata log and destination folder
Delete parquet files not referenced in the metadata folder
Print a warning, but don't delete anything if a parquet file referenced in the metadata cannot be located (prevent accidental deletion due to invalid metadata paths after folder move)
Don't delete "recent" parquet files (prevent accidental deletion of parquet files that are part of an active microbatch and haven't been committed yet)
For example, don't delete any parquet files which are younger than the last referenced parquet file
Background Many users will read from a structured streaming folder like this
or from a partition sub-folder like this
In both cases,
streaming-folder/_spark_metadata
is ignored and thus potential duplicates from incomplete microbatches might be read. One way to mitigate this problem is to periodically remove any "orphaned" parquet files which are not referenced by the_spark_metadata/
log. Care must be taken not to wrongly decide a parquet file is orphaned, even though it isn't.Details