Open EmanueleAlbero opened 3 months ago
Hey @EmanueleAlbero,
By default Streamiz and (Kafka Streams JAVA also) use one Rocksdb instance per store per partition, for Windowed store it's at least 3 Rocksdb instance per store per partition.
For each RocksDb instance, we have :
So more partitions you have, more unmanaged memory you need, especially if you have stream-stream join operations.
In JAVA, you can configure a RocksDb config setter to override the default behavior : https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html#rocksdb
In Streamiz, you can do more or less the same thing, excepting setting the index and filter in the block cache to avoid lot of unmanaged memory consumption. It could be a good enhancement to bound the unmanaged memory.
Let me fill the gap in the next release.
Hey @LGouellec thanks, very informative! Can I also ask you some more information about what I've experienced?
I saw that even if the retention kicks in or there is no data on the Kstreams for long time (hours) the memory doesn't get freed (or, when it does the freed memory is marginal). Here is an example from my last run
I see the unmanaged memory growing even when there is nothing on the feed (in a smaller way than when there are the joins but still...). In this case most of it is due to the metrics which by the way also had some issue: If I check my app /metrics endpoint I'm can only see the Streamiz' metrics (where it also expose the topology) for few minutes then it stop exposing them. I don't have a picture for this but in my last test I ran the application on an isolated environment with no data on the feed; It started using 300MB and after 5 days the memory was at 1.8GB where mostly was unmanaged memory (~1.5GB).
Hey @EmanueleAlbero ,
1- RocksDb use index and filter to rapidly get data. These index and filters are stored in the memory, but let me try to reproduce to avoid memory leak.
2- Which Metrics package do you use ? Streamiz.Kafka.Net.Metrics.Prometheus
or Streamiz.Kafka.Net.Metrics.OpenTelemetry
?
Btw, I'm currently conducting a satisfaction survey to understand how I can better serve and I would love to get your feedback on the product. Your insights are invaluable and will help us shape the future of our product to better meet your needs. The survey will only take a few minutes, and your responses will be completely confidential.
Thank you for your time and feedback! Best regards,
Hy @LGouellec I'm using OpenTelemetry.
here is the result of a test disabling the metrics at all
I've participated in the survey and I want to thank you once again for the amazing job you are doing. Let me know if I can be any more helpful.
Hey @EmanueleAlbero ,
So it seems that the OpenTelemetry exporter has a memory leak. I'll fix it. Can you test the Prometheus exporter and tell me if the problem is still there or not ?
Hi @LGouellec we are experiencing a similar memory leak when using the prometheus exporter instead of OpenTelemetry. We are using streamiz 1.6
Hey @hedmavx ,
You mean if you disable the prometheus exporter, you have no longer memory leak ?
@hedmavx ,
Can you reproduce the memory leak with the prometheus exporter and provide a thread dump please ?
Best regards,
@EmanueleAlbero
I have found the memory leak of the OpenTelemetry reporter. I'll try to fix asap.
Best regards,
@hedmavx ,
Can you reproduce the memory leak with the prometheus exporter and provide a thread dump please ?
Best regards,
Hi sadly we can't get a thread dump in the environment we are running the application
Best regards
Description
Hi, I'd start saying that I don't know if this is an actual issue or just a misconfiguration problem. However I've a topology with 2 KStreams and an Outer Join between these 2 KStreams. KStream1 receive a data every 100ms KStream2 receive a data every about 3/4s Uses a RocksDb store with default settings. Everything works fine but I can see the unmanaged memory keep growing indefinitely.
This is a picture from dotmemory of the application after several hours of work (on the very same partition). To add more context I'm also applying a 20 min Grace period, 10 min Retention Time and 10 min for
WindowStoreChangelogAdditionalRetentionMs
Run the application both on Linux or Windows environment shows a similar behavior
Is there something I can check\verify in the configuration to avoid this issue?