ant-media / Ant-Media-Server

Ant Media Server is a live streaming engine software that provides adaptive, ultra low latency streaming by using WebRTC technology with ~0.5 seconds latency. Ant Media Server is auto-scalable and it can run on-premise or on-cloud.
https://antmedia.io
Other
4.29k stars 632 forks source link

Auto-scaling results in nodes reaching 100% JVM heap memory resource and becoming unresponsive [PoC] #5446

Open timantmedia opened 1 year ago

timantmedia commented 1 year ago

Short description

The heap memory always gets to 100% utilisation and then crashes the instance which becomes inaccessible:

What version of Ant Media Server are you using?

Enterprise Edition 2.6.0 20230517_0442

How many broadcasts/viewers are you handling with your server?

2 and 6000 viewer

What are your server specifications?

c5.9xlarge

How are you publishing live streams e.g RTMP, SRT, WebRTC and what encoder are you using?

RTC

It's worth noting that the customer is also experiencing another issue possibly related, whereby when more than 9 streams are published on a c5.24xlarge, there is a significant increase in latency.

Environment

Steps to reproduce

Expected behavior

Put as much detail here as possible

Actual behavior

Put as much detail here as possible

Logs

Place logs on pastebin or elsewhere and put links here

Ask your questions on Ant Media Github Discussions

mekya commented 1 year ago

Hi @timantmedia,

I think it's an expected behaviour to encounter problem in c5.9xlarge with 6000 viewer.

We provide solution to the user with correct deployment options. I've put this to the Next Sprint and increased the priority

timantmedia commented 1 year ago

@mekya Ok thank you.

timantmedia commented 1 year ago

@mekya @muratugureminoglu Could you confirm what's happening with this? Is the other JVM heap memory issue being investigated as the same issue? Where are the "We provide a solution to the user with correct deployment options"?

timantmedia commented 1 year ago

@muratugureminoglu @mekya is there any update on this issue, I am not sure what was the outcome.

mekya commented 1 year ago

This is already in Icebox stage. It means that it's not supposed to have updates. I think there is some confusion in the issues you follow up.

timantmedia commented 1 year ago

@mekya apologies, I think I am confused with what the updates were on this issue. The previous response is "We provide solution to the user with correct deployment options. I've put this to the Next Sprint and increased the priority" does this mean we already resolved this issue then?

mekya commented 1 year ago

@mekya apologies, I think I am confused with what the updates were on this issue. The previous response is "We provide solution to the user with correct deployment options. I've put this to the Next Sprint and increased the priority" does this mean we already resolved this issue then?

No. After the message above, I've set status to Icebox. As far as I remember, the issue is not reproducible in our test with the user.

Additionally, we've provided a solution to other memory leakages issues for the users.

So I think we can keep this in Icebox or even close it.