akka / alpakka-kafka

Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
https://doc.akka.io/libraries/alpakka-kafka/current/
Other
1.42k stars 386 forks source link

Consumer.atMostOnceSource repeats messages after consumer group rebalance #1081

Open SmedbergM opened 4 years ago

SmedbergM commented 4 years ago

Versions used

"com.typesafe.akka" %% "akka-actor" % "2.5.23", "com.typesafe.akka" %% "akka-stream-kafka" % "2.0.2",

Akka version: 2.5.23

Expected Behavior

When a consumer group is consuming a topic and are created using Consumer.atMostOnceSource, a single message should not be consumed and processed twice.

Actual Behavior

A single message can be processed twice, meaning that its commit was not completed before the ConsumerRecord was emitted downstream.

Please see MWE linked below

Summary: Start consuming a topic, taking a relatively long time per message (to simulate running some kind of batch per message). Add nodes to the consumer group progressively, forcing a rebalance on each addition. Observe that a message (eg message #53 in the MWE transcript) that is in-process when a rebalance occurs can be repeated by the next node assigned that partition. More than 10% of messages were reprocessed at least once in this run of the MWE.

Relevant logs

Logged consumer settings:

2020-03-17 18:16:12.057 [consumer_system-akka.kafka.default-dispatcher-6] INFO  o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: 
    allow.auto.create.topics = true
    auto.commit.interval.ms = 5000
    auto.offset.reset = earliest
    bootstrap.servers = [kafka:9092]
    check.crcs = true
    client.dns.lookup = default
    client.id = 
    client.rack = 
    connections.max.idle.ms = 540000
    default.api.timeout.ms = 60000
    enable.auto.commit = false
    exclude.internal.topics = true
    fetch.max.bytes = 52428800
    fetch.max.wait.ms = 500
    fetch.min.bytes = 1
    group.id = mwe_group
    group.instance.id = null
    heartbeat.interval.ms = 3000
    interceptor.classes = []
    internal.leave.group.on.close = true
    isolation.level = read_uncommitted
    key.deserializer = class AtMostOnceConsumer$IntDeserializer$
    max.partition.fetch.bytes = 1048576
    max.poll.interval.ms = 300000
    max.poll.records = 5
    metadata.max.age.ms = 300000
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
    receive.buffer.bytes = 65536
    reconnect.backoff.max.ms = 1000
    reconnect.backoff.ms = 50
    request.timeout.ms = 30000
    retry.backoff.ms = 100
    sasl.client.callback.handler.class = null
    sasl.jaas.config = null
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.login.callback.handler.class = null
    sasl.login.class = null
    sasl.login.refresh.buffer.seconds = 300
    sasl.login.refresh.min.period.seconds = 60
    sasl.login.refresh.window.factor = 0.8
    sasl.login.refresh.window.jitter = 0.05
    sasl.mechanism = GSSAPI
    security.protocol = PLAINTEXT
    security.providers = null
    send.buffer.bytes = 131072
    session.timeout.ms = 10000
    ssl.cipher.suites = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    ssl.endpoint.identification.algorithm = https
    ssl.key.password = null
    ssl.keymanager.algorithm = SunX509
    ssl.keystore.location = null
    ssl.keystore.password = null
    ssl.keystore.type = JKS
    ssl.protocol = TLS
    ssl.provider = null
    ssl.secure.random.implementation = null
    ssl.trustmanager.algorithm = PKIX
    ssl.truststore.location = null
    ssl.truststore.password = null
    ssl.truststore.type = JKS
    value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

Reproducible Test Case

Please see MWE.

Full logs for the consumer nodes are found in this gist

SmedbergM commented 4 years ago

An update: using Consumer.committablePartitionedSource appears to solve this problem. Unfortunately, it does require the user to know the maximum number of partitions that a single consumer might be assigned, so there's no way to replace the existing Consumer.atMostOnceSource API.

See this branch of the MWE project.

ennru commented 4 years ago

Thank you for reporting this. We changed the internals of committing quite a bit in 1.1 and 2.0, something might have slipped in there.