apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.9k stars 4.27k forks source link

Add Kafka Streams runner #18479

Open kennknowles opened 2 years ago

kennknowles commented 2 years ago

Kafka Streams (https://kafka.apache.org/documentation/streams) has more and more features that could make it a viable candidate for a streaming runner. It uses DataFlow-like model.

 

Please look at the Design Document and add comments.

Imported from Jira BEAM-2466. Original Jira may contain additional context. Reported by: klorand.

wilsonwang371 commented 1 year ago

I am actually very interested in this topic. Generally, it is possible to have beam running on a generic faas system that with MQ support?

je-ik commented 1 year ago

There are several requirements that must be met. Essentially: a) preserving per-partition order of records (i.e. records emitted in order from one distributed producer must not overtake each other when consumed) b) producer must be able to enqueue output records for a specific consumer (e.g. assigning a key of a output record, all records with same key must then be consumed by the same instance of downstream consumer) c) producer must be able to send record to all downstream consumers (i.e. producer must know how many consumers there - possibly - is) d) there must be some kind of support of state commit, either at the end of bundle, during bundle commit (dataflow model), or as a flowing checkpoint barrier (flink model), there must be a way to safely store state in a distributed fault tolerant storage and be able to possibly restore the complete state from that committed state

Having these conditions met I think it should be possible (though quite hard) to implement Beam runner on top of it. Kafka definitely has all four (even without Kafka streams).

wilsonwang371 commented 1 year ago

There are several requirements that must be met. Essentially: a) preserving per-partition order of records (i.e. records emitted in order from one distributed producer must not overtake each other when consumed) b) producer must be able to enqueue output records for a specific consumer (e.g. assigning a key of a output record, all records with same key must then be consumed by the same instance of downstream consumer) c) producer must be able to send record to all downstream consumers (i.e. producer must know how many consumers there - possibly - is) d) there must be some kind of support of state commit, either at the end of bundle, during bundle commit (dataflow model), or as a flowing checkpoint barrier (flink model), there must be a way to safely store state in a distributed fault tolerant storage and be able to possibly restore the complete state from that committed state

Having these conditions met I think it should be possible (though quite hard) to implement Beam runner on top of it. Kafka definitely has all four (even without Kafka streams).

Thank you so much for the reply.