By default, Spark's streaming capabilities follow a "micro-batching" model, where data is collected into a batch for a window of time. At the end of that window, a batch job is launched on the cluster to process the records that fall into that window. Once the "micro-batch" is complete, Structured Streaming persists information about the completed batch in the WAL.
The new continuous processing mode, however, follows a more reactive approach to processing the streaming data, similar to how Flink and Storm do. Instead of batching up data over a windowed time frame and processing the records in batches, the framework starts up a set of long-running workers on the cluster which continuously read data from the sources and process the data into sinks.
We should make sure that we are compatible with this mode of operation, and officially document it. This will require integration tests with the given run mode. I also anticipate that there may need to be changes in how we flush data to Elasticsearch, as the sink implementation for structured streaming also batches records internally, flushing them to Elasticsearch as they fill up, or when closing the sink at the end of a micro-batch. The whole point of using continuous mode is that it reduces latency of accepting a piece of data, and thus we should be exploring ways to stream this data directly to ES if possible, or at least limiting how long data can be allowed to accumulate in the sink implementation.
By default, Spark's streaming capabilities follow a "micro-batching" model, where data is collected into a batch for a window of time. At the end of that window, a batch job is launched on the cluster to process the records that fall into that window. Once the "micro-batch" is complete, Structured Streaming persists information about the completed batch in the WAL.
The new continuous processing mode, however, follows a more reactive approach to processing the streaming data, similar to how Flink and Storm do. Instead of batching up data over a windowed time frame and processing the records in batches, the framework starts up a set of long-running workers on the cluster which continuously read data from the sources and process the data into sinks.
We should make sure that we are compatible with this mode of operation, and officially document it. This will require integration tests with the given run mode. I also anticipate that there may need to be changes in how we flush data to Elasticsearch, as the sink implementation for structured streaming also batches records internally, flushing them to Elasticsearch as they fill up, or when closing the sink at the end of a micro-batch. The whole point of using continuous mode is that it reduces latency of accepting a piece of data, and thus we should be exploring ways to stream this data directly to ES if possible, or at least limiting how long data can be allowed to accumulate in the sink implementation.