factorhouse / kpow

Kpow for Apache Kafka
https://factorhouse.io/kpow
Other
37 stars 5 forks source link

Potential memory leak ? #5

Closed AntoineDuComptoirDesPharmacies closed 2 years ago

AntoineDuComptoirDesPharmacies commented 2 years ago

Hi,

We are currently using kPow in production to follow up our kafka usage. Every day, kPow is crashing due to out of memory error. We can see that memory consumption grow on stable speed until it reach critical level where container restart.

Is there something we can do to reduce this ?

Here is a follow-up of the CPU/RAM usage of kPow : image

Consumer lag by group (oprtr.compute.metrics.v2 is purple) : image

Global lag : image

kPow version : 88.3 kPow usage : Our team rarely connect to the HMI (once per month)

Thanks in advance for your help. Yours faithfully, LCDP

d-t-w commented 2 years ago

Hi @AntoineDuComptoirDesPharmacies - thanks for the information.

If you are using our Docker container I suspect you might be running into the issue we recently identified in the Amazon Corretto 11 base image: https://kpow.io/articles/corretto-memory-issues/

This will be resolved by Amazon on July 19th (and we will release a version of Kpow with that update shortly after).

In the meantime you can update to 88.7-jdk17 with the Amazon Corretto 17 base image that doesn't have this memory issue.

AntoineDuComptoirDesPharmacies commented 2 years ago

Hi @d-t-w,

Thank you for this fast answer ! We updated to 88.7-jdk17 the 03/06/2022 but from this date we see no real changes. You can see the shutdown that occurs mid-day 03th of june to restart the container with new version, then it continue with the same pattern.

image

Note : We did not set any hard/soft limit on the kpow task in the ECS task definition. We are going to add one and let you know if it changes something.

Yours faithfully, LCDP

d-t-w commented 2 years ago

Thanks for the info @AntoineDuComptoirDesPharmacies

It's our understanding that 88.7-jdk17 should fix the container memory issue, but we've had reports that it might have the same issue as jdk11, which is puzzling.

While we look into this further I would suggest switching to our normal 88.7 container and using the short-term fix described in the blogpost where you set the following environment variable:

JVM_OPTS=-server -Dclojure.core.async.pool-size=8 -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xms1638M -Xmx1638M 

That will override the JVM_OPTS defined within our Dockerfile with ones that specifically constrain the JVM heap to 1638M (assuming you have a 2GB memory allocation to your container in ECS, this represents 80% of that allocation).

d-t-w commented 2 years ago

Hi @AntoineDuComptoirDesPharmacies I think perhaps there's a simpler solution here:

Note : We did not set any hard/soft limit on the kpow task in the ECS task definition.

I missed that originally. It is important to set container memory limits as that's how the Kpow JVM process inside the container knows how much memory it can use and when to garbage collect.

I think simply setting a reasonable Kpow container memory limit (~2GB is our normal recommendation) will resolve this issue.

AntoineDuComptoirDesPharmacies commented 2 years ago

Hi @d-t-w,

To follow up our discussion by emails and for future reader, we tried :

Here is the new memory graph which seems nice. a jpg

Yours faithfully, LCDP