elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 988 forks source link

Support Continuous Processing Mode in Spark Streaming #1422

Open jbaiera opened 4 years ago

jbaiera commented 4 years ago

By default, Spark's streaming capabilities follow a "micro-batching" model, where data is collected into a batch for a window of time. At the end of that window, a batch job is launched on the cluster to process the records that fall into that window. Once the "micro-batch" is complete, Structured Streaming persists information about the completed batch in the WAL.

The new continuous processing mode, however, follows a more reactive approach to processing the streaming data, similar to how Flink and Storm do. Instead of batching up data over a windowed time frame and processing the records in batches, the framework starts up a set of long-running workers on the cluster which continuously read data from the sources and process the data into sinks.

We should make sure that we are compatible with this mode of operation, and officially document it. This will require integration tests with the given run mode. I also anticipate that there may need to be changes in how we flush data to Elasticsearch, as the sink implementation for structured streaming also batches records internally, flushing them to Elasticsearch as they fill up, or when closing the sink at the end of a micro-batch. The whole point of using continuous mode is that it reduces latency of accepting a piece of data, and thus we should be exploring ways to stream this data directly to ES if possible, or at least limiting how long data can be allowed to accumulate in the sink implementation.

sam-h-bean commented 1 year ago

Any plan to do this? It's been a while and I'd like this!

masseyke commented 1 year ago

We would certainly like to do this, but it is not on our near-term roadmap unfortunately.