AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

ParquetStreamWriter shouldn't write if metadata is inconsistent #132

Closed kevinwallimann closed 4 years ago

kevinwallimann commented 4 years ago

Problem description When reading, Spark does not consider the metadata log if you read with a globbed path (e.g. /root-dir/*) or from a partitioned sub-directory (e.g. /root-dir/partition1=value1). Downstream applications are therefore at risk to read duplicated values in case of application failures and restarts. Two cases for inconsistent metadata logs can be distinguished

  1. Metadata log contains files that are not on the filesystem: Most likely, parquet files have been deleted / moved manually.
  2. Parquet files are present which are not in metadata log: Most likely, this is due to a previous partial write. The parquet files should be removed.

In case 1), Spark will throw a FileNotFoundException in the next write. However in case 2), Spark does not throw any exception because this case is not an error from Spark's perspective.

Proposed solution

This solution guarantees deduplicated reads for globbed paths and partitioned subdirectory reads, but doesn't guarantee atomicity, i.e. partial writes will be visible to downstream applications, but they will not be duplicated by subsequent writes.

Other