System memory and JVM heap memory gradually increasing to 100% in Kubernetes cluster

timantmedia commented 1 year ago

Short description

After gradually deploying 500 streams publishing RTMP using a simulator, the Kubernetes nodes gradually use up to 100% of the memory usage.

It does not appear that the resources are getting distributed across the available nodes evenly either with 90% of the resources being used by only 1 of the 2 nodes until eventually (after 21+ hours) that node gets to 100% memory utilisation.

The system memory and JVM heap memory are getting to 100% and crashing the node. Then another node launches and then the same happens again.

At the time 490 streams are being broadcast on a single node, resource utilisation is only at 60% and JVM heap memory at 35%, it's only after many hours resources are at 100% crashing the node.

After 20 hours, the following are the resource usage for respective nodes:

Node 1: System Memory 91%, JVM heap memory 92%, Node 2: System Memory 13%, JVM heap memory 5%.

Environment

Operating system and version: Vultr Kubernetes Engine 2 x Origin + 2 x Edge
Java version:
Ant Media Server version: 2.6.2
Browser name and version:

Steps to reproduce

Launch AMS on Kubernetes set the resource limit to 60% utilisation
After a few hours with increasing RTMP publishing (up to 500) see the memory usage exceed the 60% resource limit
Another node is created with around 10% load, the first node keeps increasing memory usage until 100% then crashes.
Wait for the second node to reach 60% utilisation and another node is created again.
loops....

Expected behavior

Resource usage should distribute evenly across the two nodes and the memory usage should not keep on increasing.

Actual behavior

Explained in detail in the meeting https://drive.google.com/file/d/12qhiSNnzpQMBWNfNICU8jaiQcj1hVxbZ/view?usp=sharing

From the customer

We have upgraded our AMS version to 2.6.2 and also upgrade resource limits

CPU Utilization : 60% System memory : 60%

We observed that System memory is increasing time by time. And around after 24 hours it goes 100% and AMS becomes irresponsible.

Is there any way to reduce the size of system memory? Or when will the system memory decrease or calm down?

Yesterday when AMS crushed with JVM heap memory, I collected the heap dump.

I have uploaded the heap dump into the following directory. If possible you can check the suspected memory leaks: https://drive.google.com/file/d/1SiUZcRGXPEjQ7WjCWIJ8aw8GvW77sdUk/view?usp=sharing

Logs

Customer disabled logs due to disk usage causing performance issues

Ask your questions on Ant Media Github Discussions

timantmedia commented 1 year ago

@burak-58 @muratugureminoglu Another customer is also experiencing JVM heap memory increasing to 100% and then crashing the node. They are using AWS and have tried manual cluster and also auto-scaling on AWS and each time, the JVM heap memory increases to 100% on nodes, which then results in them becoming unresponsive.

FD 99457.

timantmedia commented 1 year ago

@muratugureminoglu As I understand, you tried reproducing this issue and found that the JVM heap memory did not reach 100%. This was using Vultr Kubernetes, right?

mekya commented 1 year ago

Murat cannot reproduce the problem.

In this issue, the missing point are those

Instance type they use
how they send 500 streams and what's the bitrate.
Any changes in the settings

timantmedia commented 1 year ago

@mekya @muratugureminoglu

Here are the details:

System configuration

8 vCPUs 16 GB RAM

In Vultr - Loadbalancer with auto scaling enabled (Algorith : Round Robin) Also in Vultr VKE, metric service is configured to scale up or down based on the above resource limits

Ant Media Server 2.6.0 Disabled WebRTC MongoDB cluster Streaming around 500 streams from Cloud VM simulator (pushing through FFMPEG command) to AMS RTMP endpoint at 1mbps

Issue explained by customer below for reference:

Scenario / Problem Statement

Around 2~8 hours the CPU and memory are around 60% load and only one Kubernetes node (pod) is serving the request.
After 8~12 hours the CPU load remains around 60% but the system memory started to increase by around 80%.
When the System memory reached above 80%, the auto scaler was enabled and a new node (second node) got created inside the K8s cluster.
After the second node got created the load was not distributed or shared with the second node and still the first node was handling the 500 streams request. The System memory reached the fullest around 99%.

Questions How can the load after 80% memory utilisation be distributed to the second node so there is even distribution?

mekya commented 11 months ago

I'm closing this issue because we've provided the fix. Please feel free re-open if you encounter the same issue

ant-media / Ant-Media-Server