Add Trigger.ProcessingTime to Writers

AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark

Apache License 2.0

44 stars 13 forks source link

Currently, all jobs are ingested with Trigger.Once, i.e. all data is ingested into one parquet file (per kafka partition). Certain jobs may produce very large output files, leading to out of memory errors.

To prevent this, Trigger.ProcessingTime should be used.

New configuration property: writer.parquet.trigger The expected value is the number of milliseconds. If the value is not a number or the property is not present, all data should be ingested at once, as it is the case now.

The change should be available for both ParquetStreamWriter and ParquetPartitioningStreamWriter

Interaction of writer.common.trigger.processing.time and ingestor.spark.termination.method

ProcessingTime > 0, TerminationMethod=AwaitTermination: Long-running job, incoming messages processed in micro-batches. Interval according to processing time. Stop query by killing job.
ProcessingTime > 0, TerminationMethod=ProcessAllAvailable: One-time job with micro-batches, incoming messages processed in micro-batches. Query automatically stops when no more messages come in. Batch size determined by processing time. Doesn't seem to work like this outside tests
ProcessingTime not set (=> Trigger.Once), TerminationMethod=ProcessAllAvailable: One-time job with one big batch, all available messages are processed in one big batch. Query stops when no more messages come in.
ProcessingTime not set (=> Trigger.Once), TerminationMethod=AwaitTermination: Makes no sense. All available messages are processed in one big batch, but query only stops when job is killed or when memory runs out.

AbsaOSS / hyperdrive

Add Trigger.ProcessingTime to Writers #84