AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Write tool that cleans up parquet files not referenced in metadata folder #226

Closed kevinwallimann closed 3 years ago

kevinwallimann commented 3 years ago

Background Many users will read from a structured streaming folder like this

spark.read.parquet("streaming-folder/*")

or from a partition sub-folder like this

spark.read.parquet("streaming-folder/partition1=value1/partition2=value2")

In both cases, streaming-folder/_spark_metadata is ignored and thus potential duplicates from incomplete microbatches might be read. One way to mitigate this problem is to periodically remove any "orphaned" parquet files which are not referenced by the _spark_metadata/ log. Care must be taken not to wrongly decide a parquet file is orphaned, even though it isn't.

Details

kevinwallimann commented 3 years ago

Similar functionality already implemented in #132.