Closed timantmedia closed 11 months ago
@burak-58 @muratugureminoglu Another customer is also experiencing JVM heap memory increasing to 100% and then crashing the node. They are using AWS and have tried manual cluster and also auto-scaling on AWS and each time, the JVM heap memory increases to 100% on nodes, which then results in them becoming unresponsive.
FD 99457.
@muratugureminoglu As I understand, you tried reproducing this issue and found that the JVM heap memory did not reach 100%. This was using Vultr Kubernetes, right?
Murat cannot reproduce the problem.
In this issue, the missing point are those
@mekya @muratugureminoglu
Here are the details:
8 vCPUs 16 GB RAM
In Vultr - Loadbalancer with auto scaling enabled (Algorith : Round Robin) Also in Vultr VKE, metric service is configured to scale up or down based on the above resource limits
Ant Media Server 2.6.0 Disabled WebRTC MongoDB cluster Streaming around 500 streams from Cloud VM simulator (pushing through FFMPEG command) to AMS RTMP endpoint at 1mbps
Scenario / Problem Statement
Questions How can the load after 80% memory utilisation be distributed to the second node so there is even distribution?
I'm closing this issue because we've provided the fix. Please feel free re-open if you encounter the same issue
Short description
After gradually deploying 500 streams publishing RTMP using a simulator, the Kubernetes nodes gradually use up to 100% of the memory usage.
It does not appear that the resources are getting distributed across the available nodes evenly either with 90% of the resources being used by only 1 of the 2 nodes until eventually (after 21+ hours) that node gets to 100% memory utilisation.
The system memory and JVM heap memory are getting to 100% and crashing the node. Then another node launches and then the same happens again.
At the time 490 streams are being broadcast on a single node, resource utilisation is only at 60% and JVM heap memory at 35%, it's only after many hours resources are at 100% crashing the node.
After 20 hours, the following are the resource usage for respective nodes:
Node 1: System Memory 91%, JVM heap memory 92%, Node 2: System Memory 13%, JVM heap memory 5%.
Environment
Steps to reproduce
Expected behavior
Resource usage should distribute evenly across the two nodes and the memory usage should not keep on increasing.
Actual behavior
Explained in detail in the meeting https://drive.google.com/file/d/12qhiSNnzpQMBWNfNICU8jaiQcj1hVxbZ/view?usp=sharing
From the customer
We have upgraded our AMS version to 2.6.2 and also upgrade resource limits
CPU Utilization : 60% System memory : 60%
We observed that System memory is increasing time by time. And around after 24 hours it goes 100% and AMS becomes irresponsible.
Is there any way to reduce the size of system memory? Or when will the system memory decrease or calm down?
Yesterday when AMS crushed with JVM heap memory, I collected the heap dump.
I have uploaded the heap dump into the following directory. If possible you can check the suspected memory leaks: https://drive.google.com/file/d/1SiUZcRGXPEjQ7WjCWIJ8aw8GvW77sdUk/view?usp=sharing
Logs
Customer disabled logs due to disk usage causing performance issues
Ask your questions on Ant Media Github Discussions