AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Add Trigger.ProcessingTime to Writers #84

Closed kevinwallimann closed 4 years ago

kevinwallimann commented 4 years ago

Currently, all jobs are ingested with Trigger.Once, i.e. all data is ingested into one parquet file (per kafka partition). Certain jobs may produce very large output files, leading to out of memory errors.

To prevent this, Trigger.ProcessingTime should be used.

New configuration property: writer.parquet.trigger The expected value is the number of milliseconds. If the value is not a number or the property is not present, all data should be ingested at once, as it is the case now.

The change should be available for both ParquetStreamWriter and ParquetPartitioningStreamWriter

kevinwallimann commented 4 years ago

Interaction of writer.common.trigger.processing.time and ingestor.spark.termination.method