AirenSoft / OvenMediaEngine

OvenMediaEngine (OME) is a Sub-Second Latency Live Streaming Server with Large-Scale and High-Definition. #WebRTC #LLHLS
https://airensoft.com/ome.html
GNU Affero General Public License v3.0
2.59k stars 1.06k forks source link

pushes fail when one of the push destination times out #1573

Open getroot opened 7 months ago

getroot commented 7 months ago

Discussed in https://github.com/AirenSoft/OvenMediaEngine/discussions/1572

Originally posted by **vampirefrog** March 31, 2024 I was streaming to about 12 pushes, when one of them went down (I mean the target site to which I was streaming, which is an OSP instance), and then all the other pushes started going down. Below you can see all I could find in the log. Is there some kind of time out setting, or a setting to not take the other pushes down when one is stuck? Or is this a bug in OME? It seems to be similar to this: https://github.com/AirenSoft/OvenMediaEngine/issues/819 ``` Mar 31 00:37:05 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:05.656] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 17430, threshold: 500, peak: 17430 Mar 31 00:37:10 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:10.668] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 17815, threshold: 500, peak: 17815 Mar 31 00:37:15 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:15.681] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 18201, threshold: 500, peak: 18201 Mar 31 00:37:20 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:20.693] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 18586, threshold: 500, peak: 18586 Mar 31 00:37:25 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:25.707] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 18971, threshold: 500, peak: 18971 Mar 31 00:37:30 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:30.742] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 19358, threshold: 500, peak: 19358 Mar 31 00:37:35 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:35.777] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 19746, threshold: 500, peak: 19746 Mar 31 00:37:40 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:40.811] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 20133, threshold: 500, peak: 20133 Mar 31 00:37:45 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:45.822] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 20518, threshold: 500, peak: 20518 Mar 31 00:37:50 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:50.838] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 20903, threshold: 500, peak: 20903 Mar 31 00:37:55 server3 OvenMediaEngine[2634930]: [2024-03-31 00:37:55.851] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 21289, threshold: 500, peak: 21289 Mar 31 00:38:00 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:00.865] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 21674, threshold: 500, peak: 21674 Mar 31 00:38:05 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:05.878] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 22060, threshold: 500, peak: 22060 Mar 31 00:38:10 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:10.891] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 22446, threshold: 500, peak: 22446 Mar 31 00:38:15 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:15.905] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 22830, threshold: 500, peak: 22830 Mar 31 00:38:20 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:20.918] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 23216, threshold: 500, peak: 23216 Mar 31 00:38:25 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:25.974] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 23604, threshold: 500, peak: 23604 Mar 31 00:38:30 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:30.990] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 23990, threshold: 500, peak: 23990 Mar 31 00:38:36 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:35.001] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 24375, threshold: 500, peak: 24375 Mar 31 00:38:41 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:41.015] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 24761, threshold: 500, peak: 24761 Mar 31 00:38:46 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:46.023] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 25146, threshold: 500, peak: 25146 Mar 31 00:38:51 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:51.042] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 25531, threshold: 500, peak: 25531 Mar 31 00:38:56 server3 OvenMediaEngine[2634930]: [2024-03-31 00:38:56.076] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 25918, threshold: 500, peak: 25918 Mar 31 00:39:01 server3 OvenMediaEngine[2634930]: [2024-03-31 00:39:01.090] W [AW-RTMPPush0:2634947] ManagedQueue | managed_queue.h:313 | [114] mngq:v=#default#live:s=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:p=pub:n=streamworker_rtmppush size has exceeded the threshold: queue: 26304, threshold: 500, peak: 26304 Mar 31 00:39:07 server3 systemd[1]: ovenmediaengine.service: A process of this unit has been killed by the OOM killer. ░░ Subject: A process of ovenmediaengine.service unit has been killed by the OOM killer. ```
vampirefrog commented 7 months ago

The destination (OSP) had an internet hickup, and connections would stall. So I'm assuming that the output queue in OME just got bigger and bigger till it crashed. Perhaps if the queue grows too big, OME should consider the connection ended and try to reconnect. From the logs it looks like it only warns when the threshold is reached, but takes no action. Perhaps there can be a warning threshold and a kill threshold. I don't know if it already does this but that's my two cents.

irlkitcom commented 7 months ago

Hopefully this is resolved soon, one of the important features to us.

irlkitcom commented 7 months ago

Can you post your Server.xml? I'm curious to see your settings.

vampirefrog commented 7 months ago

I use this as a template on multiple servers:

<?xml version="1.0" encoding="UTF-8" ?>
<Server version="8">
    <Name>OvenMediaEngine</Name>
    <Type>origin</Type>
    <IP>*</IP>
    <PrivacyProtection>false</PrivacyProtection>
    <Modules>
        <HTTP2><Enable>true</Enable></HTTP2>
        <LLHLS><Enable>true</Enable></LLHLS>
    </Modules>
    <Bind>
        <Managers>
            <API>
                <Port>60080</Port>
                <TLSPort>60443</TLSPort>
                <WorkerCount>1</WorkerCount>
            </API>
        </Managers>
        <Providers>
            <RTMP>
                <Port>61935</Port>
                <WorkerCount>1</WorkerCount>
            </RTMP>
            <SRT>
                <Port>60999</Port>
                <WorkerCount>1</WorkerCount>
            </SRT>
        </Providers>
        <Publishers>
            <LLHLS>
                <Port>60080</Port>
                <TLSPort>60443</TLSPort>
                <WorkerCount>1</WorkerCount>
            </LLHLS>
        </Publishers>
    </Bind>
    <Managers>
        <Host>
            <Names>
                <Name>localhost:60080</Name>
            </Names>
            <TLS>
                <CertPath>cert.pem</CertPath>
                <KeyPath>key.pem</KeyPath>
                <ChainCertPath>cert.pem</ChainCertPath>
            </TLS>
        </Host>
        <API>
            <AccessToken>poopy</AccessToken>
            <CrossDomains><Url>*</Url></CrossDomains>
        </API>
    </Managers>
    <VirtualHosts>
        <VirtualHost>
            <Name>default</Name>
            <Distribution>LiveJoiner</Distribution>
            <Host>
                <Names><Name>*</Name></Names>
                <TLS>
                    <CertPath>cert.pem</CertPath>
                    <KeyPath>key.pem</KeyPath>
                    <ChainCertPath>cert.pem</ChainCertPath>
                </TLS>
            </Host>
            <Applications>
                <Application>
                    <Name>live</Name>
                    <Type>live</Type>
                    <Providers><RTMP/><SRT/></Providers>
                    <Publishers><RTMPPush></RTMPPush><LLHLS><CrossDomains><Url>*</Url></CrossDomains></LLHLS></Publishers>
                    <OutputProfiles>
                        <OutputProfile>
                            <Name>bypass_stream</Name>
                            <OutputStreamName>${OriginStreamName}</OutputStreamName>
                            <Encodes>
                                <Audio>
                                    <Name>bypass_audio</Name>
                                    <Bypass>true</Bypass>
                                </Audio>
                                <Video>
                                    <Name>bypass_video</Name>
                                    <Bypass>true</Bypass>
                                </Video>
                            </Encodes>
                        </OutputProfile>
                    </OutputProfiles>
                </Application>
            </Applications>
            <AdmissionWebhooks>
                <ControlServerUrl>http://localhost:9999/webhook/25</ControlServerUrl>
                <SecretKey>asdf</SecretKey>
                <Timeout>3000</Timeout>
                <Enables>
                    <Providers>rtmp,srt</Providers>
                    <Publishers></Publishers>
                </Enables>
            </AdmissionWebhooks>
            <CrossDomains>
                <Url>*</Url>
            </CrossDomains>
        </VirtualHost>
    </VirtualHosts>
</Server>
stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Keukhan commented 5 months ago

@vampirefrog

Thank you for reporting. I think I know the cause of the issue you sent me. I'll put it in my improvement plan. I will contact you when it is completed.

vampirefrog commented 5 months ago

Thanks, buddy. I also need to mention that the server is a $4 digital ocean droplet in SFO3 so you can probably test that way if that helps any.

vampirefrog commented 5 months ago

Also, there is another issue here, it doesn't tell you which push is experiencing the issue, it just says that it's one of the pushes. Could you guys add the push ID in the log?

In fact, for me, the server admin, this would be very useful.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vampirefrog commented 3 months ago

Y'all still working on this?I've found another issue where a long running server fills up memory and will not send some pushes, although they appear in the push list, and I'm not sure whether it's the same issue or not. After a restart it works.

Keukhan commented 2 months ago

@vampirefrog

I found a hang in a specific Push session and applied a Timeout to avoid it. I've patched the master branch. Please test it when you have time to see if the issue is fixed.

https://github.com/AirenSoft/OvenMediaEngine/commit/b53080a50fb9e877ee2cd30d26dd706c202d1264

stale[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vampirefrog commented 3 weeks ago

still testing this, it does seem to be working better lately but it's a long term test anyway