ParquetStreamWriter shouldn't write if metadata is inconsistent

Problem description When reading, Spark does not consider the metadata log if you read with a globbed path (e.g. /root-dir/*) or from a partitioned sub-directory (e.g. /root-dir/partition1=value1). Downstream applications are therefore at risk to read duplicated values in case of application failures and restarts. Two cases for inconsistent metadata logs can be distinguished

Metadata log contains files that are not on the filesystem: Most likely, parquet files have been deleted / moved manually.
Parquet files are present which are not in metadata log: Most likely, this is due to a previous partial write. The parquet files should be removed.

In case 1), Spark will throw a FileNotFoundException in the next write. However in case 2), Spark does not throw any exception because this case is not an error from Spark's perspective.

Proposed solution

In the ParquetStreamWriter, the metadata log should be inspected and compared with the filesystem before writing. If it is inconsistent, a warning will be logged which lists the files to be deleted.
No automatic cleanup is considered, because partial writes are assumed to occur only rarely (even more rarely with https://issues.apache.org/jira/browse/SPARK-27210) and more importantly, automatic cleanups could result in inadvertent deletions if the metadata log has been tampered with.
An option to skip the check should be added, if the user knows what he does

This solution guarantees deduplicated reads for globbed paths and partitioned subdirectory reads, but doesn't guarantee atomicity, i.e. partial writes will be visible to downstream applications, but they will not be duplicated by subsequent writes.

Other

After #82 has been implemented, the functionality could be extracted to a transformer instead of having a switch on/off flag for the ParquetStreamWriter

AbsaOSS / hyperdrive

ParquetStreamWriter shouldn't write if metadata is inconsistent #132