Open tigerinus opened 2 years ago
If call flush()
on every publish, instead of every 1000 messages, memory still leaks but it's just very slow. Will get OOMKill eventuall, like in 30mins.
I've added some verbose logging to capture the remaining unpublished messages in Kafka internal queue every second (updated the snippet above):
2022-06-03 13:24:29,463 MainProcess(9) INFO kafka_consumer::__count_consumed - 1 messages consumed - last offset: 62597, last timestamp: 2022-05-29 18:02:19.732000 (1653847339732)
2022-06-03 13:24:35,550 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 999 messages published (0 messages pending for delivery)
2022-06-03 13:24:42,028 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:24:48,462 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:24:54,916 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:24:54,983 MainProcess(9) INFO kafka_consumer::__count_consumed - 1 messages consumed - last offset: 62598, last timestamp: 2022-05-29 18:02:27.617000 (1653847347617)
2022-06-03 13:25:01,379 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 998 messages published (0 messages pending for delivery)
2022-06-03 13:25:07,885 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:14,369 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:20,849 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:27,665 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:34,190 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:40,643 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:46,975 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 996 messages published (0 messages pending for delivery)
2022-06-03 13:25:53,397 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
2022-06-03 13:25:53,610 MainProcess(9) INFO kafka_consumer::__count_consumed - 1 messages consumed - last offset: 62599, last timestamp: 2022-05-29 18:02:29.433000 (1653847349433)
2022-06-03 13:25:59,878 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 996 messages published (0 messages pending for delivery)
2022-06-03 13:26:06,369 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 999 messages published (0 messages pending for delivery)
2022-06-03 13:26:12,721 KafkaProducerWorker(10) INFO kafka_producer::__count_published - 1001 messages published (0 messages pending for delivery)
Most of time there is 0 message pending for delivery, i.e. all messages are published in time. Thus the high memory usage is unlikely due to the remaining messages in the queue.
@tigerinus is this also present in 1.9.0?
worth trying 1.9.0, but I don't recall this coming up.
it's unusual to call flush except on producer shutdown. perhaps try poll based solution instead ( something along the lines of https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/asyncio_example.py )
with that said, i'm going to preemptively label this a bug, even though I haven't looked into it as I don't see why this should leak, and I believe you that it does.
Note from one of customers: Has this issue on 1.9.0.
Although, worth pointing out that I still get the leak without calling flush.
does the issue persist if you specify a delivery callback method?
i'm having the same issue with my fastapi application using the latest version of the lib. just creating a producer without sending any message cause it to leak from the same spot.
Description
I have a microservice that consumes messages from Kafka, do some work with it, and publish the result back to Kafka.
However it quickly get OOMKilled after started.
With help of memory profile, I managed to figure out that it's
rdk:broker0
contributed the biggest memory usage (In my example it's a 384MiB pod in Kubernetes)As seen in this report, there is no Python object that holds anything larger than 2MB from GC; It's
rdk:broker0
holding4460
allocations and165MiB
of memory unreleased.Here is the
KafkaProducerService
code that callsProducer
:How to reproduce
Checklist
Please provide the following information:
confluent_kafka.version()
andconfluent_kafka.libversion()
):{...}
default, except
bootstrap.servers
Reproduced on both Alpine and Debian (Bullseye)