EmanueleAlbero commented 2 months ago

Description

Hi, I'd start saying that I don't know if this is an actual issue or just a misconfiguration problem. However I've a topology with 2 KStreams and an Outer Join between these 2 KStreams. KStream1 receive a data every 100ms KStream2 receive a data every about 3/4s Uses a RocksDb store with default settings. Everything works fine but I can see the unmanaged memory keep growing indefinitely.

Screenshot 2024-08-07 125338

This is a picture from dotmemory of the application after several hours of work (on the very same partition). To add more context I'm also applying a 20 min Grace period, 10 min Retention Time and 10 min for WindowStoreChangelogAdditionalRetentionMs

Run the application both on Linux or Windows environment shows a similar behavior

Is there something I can check\verify in the configuration to avoid this issue?

LGouellec commented 2 months ago

Hey @EmanueleAlbero,

By default Streamiz and (Kafka Streams JAVA also) use one Rocksdb instance per store per partition, for Windowed store it's at least 3 Rocksdb instance per store per partition.

For each RocksDb instance, we have :

Read cache = 50 Mb (unmanaged memory for read operations)
Write cache = 48Mb (unmanaged memory for write operations)
- Index and Filter = This is the tricky point here, Index and Filter are not included on the Read cache, so basically this amount of memory is not bounded.

So more partitions you have, more unmanaged memory you need, especially if you have stream-stream join operations.

In JAVA, you can configure a RocksDb config setter to override the default behavior : https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html#rocksdb

In Streamiz, you can do more or less the same thing, excepting setting the index and filter in the block cache to avoid lot of unmanaged memory consumption. It could be a good enhancement to bound the unmanaged memory.

Let me fill the gap in the next release.

EmanueleAlbero commented 2 months ago

Hey @LGouellec thanks, very informative! Can I also ask you some more information about what I've experienced?

I saw that even if the retention kicks in or there is no data on the Kstreams for long time (hours) the memory doesn't get freed (or, when it does the freed memory is marginal). Here is an example from my last run
I see the unmanaged memory growing even when there is nothing on the feed (in a smaller way than when there are the joins but still...). In this case most of it is due to the metrics which by the way also had some issue: If I check my app /metrics endpoint I'm can only see the Streamiz' metrics (where it also expose the topology) for few minutes then it stop exposing them. I don't have a picture for this but in my last test I ran the application on an isolated environment with no data on the feed; It started using 300MB and after 5 days the memory was at 1.8GB where mostly was unmanaged memory (~1.5GB).

LGouellec commented 2 months ago

Hey @EmanueleAlbero ,

1- RocksDb use index and filter to rapidly get data. These index and filters are stored in the memory, but let me try to reproduce to avoid memory leak.

2- Which Metrics package do you use ? Streamiz.Kafka.Net.Metrics.Prometheus or Streamiz.Kafka.Net.Metrics.OpenTelemetry?

Btw, I'm currently conducting a satisfaction survey to understand how I can better serve and I would love to get your feedback on the product. Your insights are invaluable and will help us shape the future of our product to better meet your needs. The survey will only take a few minutes, and your responses will be completely confidential.

Survey

Thank you for your time and feedback! Best regards,

EmanueleAlbero commented 2 months ago

Hy @LGouellec I'm using OpenTelemetry.

here is the result of a test disabling the metrics at all

I've participated in the survey and I want to thank you once again for the amazing job you are doing. Let me know if I can be any more helpful.

LGouellec commented 2 months ago

Hey @EmanueleAlbero ,

So it seems that the OpenTelemetry exporter has a memory leak. I'll fix it. Can you test the Prometheus exporter and tell me if the problem is still there or not ?

hedmavx commented 1 month ago

Hi @LGouellec we are experiencing a similar memory leak when using the prometheus exporter instead of OpenTelemetry. We are using streamiz 1.6

LGouellec commented 1 month ago

Hey @hedmavx ,

You mean if you disable the prometheus exporter, you have no longer memory leak ?

LGouellec commented 2 weeks ago

@hedmavx ,

Can you reproduce the memory leak with the prometheus exporter and provide a thread dump please ?

Best regards,

LGouellec commented 2 weeks ago

@EmanueleAlbero

I have found the memory leak of the OpenTelemetry reporter. I'll try to fix asap.

Best regards,

LGouellec / streamiz

Struggling with unmanaged memory #361

Description