Closed jainal09 closed 7 months ago
Even modifying your Transactional Consume-Process-Produce example a little generates duplicate message. Below is the reproducible example:
import asyncio
from aiokafka import TopicPartition, AIOKafkaConsumer, AIOKafkaProducer
from fault_tolerant_aio_kafka.logs import logger
IN_TOPIC = "my_topic"
GROUP_ID = "my_group_id"
OUT_TOPIC = "new_topic"
TRANSACTIONAL_ID = "my-txn-id"
BOOTSTRAP_SERVERS = "localhost:9092"
POLL_TIMEOUT = 60_000
def process_batch(msgs):
# Group by key do simple count sampling by a minute window
# buckets_by_key = defaultdict(Counter)
# for msg in msgs:
# timestamp = (msg.timestamp // 60_000) * 60
# buckets_by_key[msg.key][timestamp] += 1
res = []
# for key, counts in buckets_by_key.items():
# for timestamp, count in counts.items():
# value = str(count).encode()
# res.append((key, value, timestamp))
for msg in msgs:
res.append(msg.value)
return res
async def transactional_process():
consumer = AIOKafkaConsumer(
IN_TOPIC,
bootstrap_servers=BOOTSTRAP_SERVERS,
enable_auto_commit=False,
group_id=GROUP_ID,
isolation_level="read_committed" # <-- This will filter aborted txn's
)
await consumer.start()
print("consumer started")
producer = AIOKafkaProducer(
bootstrap_servers=BOOTSTRAP_SERVERS,
transactional_id=TRANSACTIONAL_ID
)
await producer.start()
try:
while True:
msg_batch = await consumer.getmany(timeout_ms=POLL_TIMEOUT)
async with producer.transaction():
commit_offsets = {}
in_msgs = []
for tp, msgs in msg_batch.items():
in_msgs.extend(msgs)
commit_offsets[tp] = msgs[-1].offset + 1
out_msgs = process_batch(in_msgs)
for msg in out_msgs:
logger.info(f"Received message: {msg}")
await producer.send(
OUT_TOPIC, value=msg
)
# We commit through the producer because we want the commit
# to only succeed if the whole transaction is done
# successfully.
await producer.send_offsets_to_transaction(
commit_offsets, GROUP_ID)
finally:
await consumer.stop()
await producer.stop()
if __name__ == "__main__":
asyncio.run(transactional_process())
At this point I am really confused what I am doing wrong and if @tvoinarovskyi, @fabregas, @ods, @multani, @selevit could please explain and help me resolve this it will be really helpful!
@jainal09 How many messages are you producing, what is the duplicated rate ? You random generator function is only producing 456976 unique strings. With 100 messages, you will have around a probability of 1% duplicates, 1000 messages, 66% (approximation of the "birthday problem"). Even better, you can run the same test but producing a unique incrementing sequence
I believe you are right and using uuid.uuid4 gave me no duplicate values
Describe the bug I have the script which fetches a message and produces the message to a new topic and commit the offset to a transaction, The problem is that I am receiving duplicate messages in the new topic and that is because it is committed but, when I fetch the new message with incremented offsets I am receiving the previously committed message
Expected behaviour A message should be consumed and produced to the new topic. Then the offsets needs to be committed and then a new message should be consumed with the updated offsets.
Environment (please complete the following information):
python -c "import aiokafka; print(aiokafka.__version__)"
):0.8.1python -c "import kafka; print(kafka.__version__)"
): 2.0.2kafka-topics.sh --version
): 3.3.1 (Commit:e23c59d00e687ff5)Reproducible example
Producer Code
Consumer which checks for duplicate messages