ant-media / Ant-Media-Server

Ant Media Server is a live streaming engine software that provides adaptive, ultra low latency streaming by using WebRTC technology with ~0.5 seconds latency. Ant Media Server is auto-scalable and it can run on-premise or on-cloud.
https://antmedia.io
Other
4.24k stars 621 forks source link

System memory and JVM heap memory gradually increasing to 100% in Kubernetes cluster #5399

Closed timantmedia closed 11 months ago

timantmedia commented 1 year ago

Short description

After gradually deploying 500 streams publishing RTMP using a simulator, the Kubernetes nodes gradually use up to 100% of the memory usage.

It does not appear that the resources are getting distributed across the available nodes evenly either with 90% of the resources being used by only 1 of the 2 nodes until eventually (after 21+ hours) that node gets to 100% memory utilisation.

The system memory and JVM heap memory are getting to 100% and crashing the node. Then another node launches and then the same happens again.

At the time 490 streams are being broadcast on a single node, resource utilisation is only at 60% and JVM heap memory at 35%, it's only after many hours resources are at 100% crashing the node.

After 20 hours, the following are the resource usage for respective nodes:

Node 1: System Memory 91%, JVM heap memory 92%, Node 2: System Memory 13%, JVM heap memory 5%.

Environment

Steps to reproduce

  1. Launch AMS on Kubernetes set the resource limit to 60% utilisation
  2. After a few hours with increasing RTMP publishing (up to 500) see the memory usage exceed the 60% resource limit
  3. Another node is created with around 10% load, the first node keeps increasing memory usage until 100% then crashes.
  4. Wait for the second node to reach 60% utilisation and another node is created again.
  5. loops....

Expected behavior

Resource usage should distribute evenly across the two nodes and the memory usage should not keep on increasing.

Actual behavior

Explained in detail in the meeting https://drive.google.com/file/d/12qhiSNnzpQMBWNfNICU8jaiQcj1hVxbZ/view?usp=sharing

From the customer

We have upgraded our AMS version to 2.6.2 and also upgrade resource limits

CPU Utilization : 60% System memory : 60%

We observed that System memory is increasing time by time. And around after 24 hours it goes 100% and AMS becomes irresponsible.

Is there any way to reduce the size of system memory? Or when will the system memory decrease or calm down?

Yesterday when AMS crushed with JVM heap memory, I collected the heap dump.

I have uploaded the heap dump into the following directory. If possible you can check the suspected memory leaks: https://drive.google.com/file/d/1SiUZcRGXPEjQ7WjCWIJ8aw8GvW77sdUk/view?usp=sharing

image

Logs

Customer disabled logs due to disk usage causing performance issues

Ask your questions on Ant Media Github Discussions

timantmedia commented 1 year ago

@burak-58 @muratugureminoglu Another customer is also experiencing JVM heap memory increasing to 100% and then crashing the node. They are using AWS and have tried manual cluster and also auto-scaling on AWS and each time, the JVM heap memory increases to 100% on nodes, which then results in them becoming unresponsive.

FD 99457.

timantmedia commented 1 year ago

@muratugureminoglu As I understand, you tried reproducing this issue and found that the JVM heap memory did not reach 100%. This was using Vultr Kubernetes, right?

mekya commented 1 year ago

Murat cannot reproduce the problem.

In this issue, the missing point are those

timantmedia commented 1 year ago

@mekya @muratugureminoglu

Here are the details:

System configuration

8 vCPUs 16 GB RAM

In Vultr - Loadbalancer with auto scaling enabled (Algorith : Round Robin) Also in Vultr VKE, metric service is configured to scale up or down based on the above resource limits

Ant Media Server 2.6.0 Disabled WebRTC MongoDB cluster Streaming around 500 streams from Cloud VM simulator (pushing through FFMPEG command) to AMS RTMP endpoint at 1mbps

Issue explained by customer below for reference:

Scenario / Problem Statement

Questions How can the load after 80% memory utilisation be distributed to the second node so there is even distribution?

mekya commented 11 months ago

I'm closing this issue because we've provided the fix. Please feel free re-open if you encounter the same issue