apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Bug]: ReadFromKafkaViaSDF ignores configuration from redistribute and allow duplicates and always enables them #32196

Closed scwhittle closed 1 month ago

scwhittle commented 1 month ago

What happened?

Introduced in https://github.com/apache/beam/commit/9cbdda1b4e52452728cc9da2fa8498d0ace5ed7b so this is affecting the 2.58 beam release. The Dataflow v2 Runner uses this version for Kafka by default.

Since this can introduce duplicates and is unexpected, marking as P1 and considering to release a patched sdk version to address it.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

scwhittle commented 1 month ago

Possible work-arounds:

scwhittle commented 1 month ago

Upon further investigation, the incorrect options are set on the ReadSourceDescriptors transform and not used for determinining redistribute and allowed duplicates of the read elements as that uses the original kafkaRead here

The effect of the misconfiguration is if commitoffets is enabled it is not performed due to the logic here

scwhittle commented 1 month ago

Fixed in 2.58.1