AirenSoft / OvenMediaEngine

OvenMediaEngine (OME) is a Sub-Second Latency Live Streaming Server with Large-Scale and High-Definition. #WebRTC #LLHLS
https://airensoft.com/ome.html
GNU Affero General Public License v3.0
2.55k stars 1.06k forks source link

Can't scale out origin over 64 rtmp streams #999

Closed oleglab closed 1 year ago

oleglab commented 1 year ago

PROBLEM

Origin performes very well until we injest ~64 RTMP streams, but as soon as it gets few more connections the OVT connections get dropped in cascade.

EXPECTATION

The CPU and memory are barely utilized because we don't use transcoding, so it would be great to be able to scale it out more than 64 streams.

ENVIRONMENTAL INFORMATION

The latest docker

Dockerfile:

ARG OME_VERSION=sha256:a765c0743cc3c00ded0acef1903bbcc8fec41564cb749b387c700ecf625a84e1

FROM airensoft/ovenmediaengine@${OME_VERSION}

Just 100 rtmp streams generated by ffmpeg with h264:

ffmpeg -nostdin -hide_banner -loglevel error -f lavfi -r 30 -re -t 60 -i testsrc=size=640x380:rate=30 -vcodec libx264 -profile:v baseline -pix_fmt yuv420p -force_key_frames 'expr:gte(t,n_forced*1)' -f flv rtmp://$ORIGIN_HOSTNAME/$STREAM > output-$STREAM.log 2>&1 < /dev/null &

Oven configuration:

    <!-- Settings for applications -->
    <Applications>
        <Application>
            <Name>dev</Name>
            <!-- Application type (live/vod) -->
            <Type>live</Type>
            <OutputProfiles>
                <OutputProfile>
                    <Name>bypass_stream</Name>
                    <OutputStreamName>${OriginStreamName}</OutputStreamName>
                    <Encodes>
                        <Video>
                            <Name>bypass_video</Name>
                            <Bypass>true</Bypass>
                        </Video>
                    </Encodes>
                </OutputProfile>
            </OutputProfiles>
            <Providers>
                <RTMP>
                    <BlockDuplicateStreamName>true</BlockDuplicateStreamName>
                </RTMP>
                <SRT>
                    <BlockDuplicateStreamName>true</BlockDuplicateStreamName>
                </SRT>
            </Providers>
            <Publishers>
                <AppWorkerCount>4</AppWorkerCount>
                <StreamWorkerCount>4</StreamWorkerCount>
                <OVT />
            </Publishers>
        </Application>
    </Applications>
    <OriginMapStore>
        <!-- In order to use OriginMap, you must enable OVT Publisher in Origin and OVT Provider in Edge. -->
        <RedisServer>
            <Host>${env:REDIS_HOST}:${env:DEV_REDIS_PORT}</Host>
            <Auth>${env:REDIS_SECRET}</Auth>
        </RedisServer>

        <!-- This is only needed for the origin server and used to register the ovt address of the stream.  -->
        <OriginHostName>${env:ORIGIN_HOSTNAME}</OriginHostName>
    </OriginMapStore>

no player is needed

Reproducible on Mac and Ubuntu with Docker above ^^^

SETUP INFORMATION AND LOGS

Please upload Server.xml. config.zip

[2023-01-12 00:16:18.620] W [OutboundWorker:31] MediaRouter | mediarouter_stream.cpp:1039 | [#dev#dev/streamXX] Detected out of order DTS of packet. track_id:0 dts:536201->536201 [2023-01-12 00:16:18.659] W [OutboundWorker:31] MediaRouter | mediarouter_stream.cpp:1039 | [#dev#dev/streamXX] Detected out of order DTS of packet. track_id:0 dts:536202->536201 [2023-01-12 00:16:51.457] I [SPRTMP-T1935:14] Provider | stream.cpp:49 | Unknown/(50) has been started stream [2023-01-12 00:16:51.458] I [SPRTMP-T1935:14] RTMPProvider | rtmp_provider.cpp:152 | A RTMP client has connected from <ClientSocket: 0x7f66d6fe5610, #50, Connected, TCP, Nonblocking, IP1:38518> [2023-01-12 00:16:51.722] I [SPRTMP-T1935:14] AccessController | access_controller.cpp:183 | AdmissionWebhooks queried http://{host}/v1/ac whether client 10.138.0.20:38518 could access rtmp://{host}:1935/dev/streamXX. (Result : Allow Elapsed : 8 ms) [2023-01-12 00:17:04.564] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2c610, #38, Closed, TCP, Nonblocking, IP2:26662> [2023-01-12 00:17:04.564] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2ec10, #41, Closed, TCP, Nonblocking, IP2:26698> [2023-01-12 00:17:05.565] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e05810, #32, Closed, TCP, Nonblocking, IP2:10424> [2023-01-12 00:17:05.565] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2fc10, #36, Closed, TCP, Nonblocking, IP2:26642> [2023-01-12 00:17:05.565] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2e610, #35, Closed, TCP, Nonblocking, IP2:26628> [2023-01-12 00:17:05.565] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2d210, #34, Closed, TCP, Nonblocking, IP2:10432> [2023-01-12 00:17:05.565] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e05610, #33, Closed, TCP, Nonblocking, IP2:10430> [2023-01-12 00:17:05.565] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2d610, #37, Closed, TCP, Nonblocking, IP2:26648> [2023-01-12 00:17:05.608] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:140 | OvtProvider is connected : <ClientSocket: 0x7f66d9e05610, #32, Connected, TCP, Nonblocking, IP2:24836> [2023-01-12 00:17:05.609] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e06410, #39, Closed, TCP, Nonblocking, IP2:26678> [2023-01-12 00:17:05.609] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:225 | OvtProvider is disconnected(1) : <ClientSocket: 0x7f66d9e2be10, #40, Closed, TCP, Nonblocking, IP2:26688> [2023-01-12 00:17:05.610] I [SPOvtPub-T9000:9] OVT | ovt_publisher.cpp:140 | OvtProvider is connected : <ClientSocket: 0x7f66d9e06410, #33, Connected, TCP, Nonblocking, IP2:24840>

OTHER HELPFUL INFORMATION

Please write any information that anyone viewing this issue can reference.

getroot commented 1 year ago

Your ffmepg is encoding with x264. Is your hardware good enough for 100 encodings? Check your CPU stats. And is your network performance good enough?

And check if the kernel option is tuned. Check the socket buffer size and ulimit. https://airensoft.gitbook.io/ovenmediaengine/troubleshooting#5-2.-tuning-your-linux-kernel

oleglab commented 1 year ago

@getroot we could scale out to 300 ffmpeg instances, but without trying to watch the video. As soon as we are trying to pull RTSP stream or watch the streams, the performance get slumped and origin get stuck... And streams might be overridden or pushed out by new ones:

local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_application.cpp:255  | [#default#push/load73(100)] Trying to create a stream
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_stream.cpp:54   | [#default#push/load73(100)] Trying to create a mediarouter stream
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Transcoder | transcoder_stream.cpp:626  | [#default#push/load73(100)] Output stream has been created. [#default#push/load73(588761627)]
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Transcoder | transcoder_stream.cpp:86   | No decoder generated
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_application.cpp:255  | [#default#push/load73(588761627)] Trying to create a stream
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_stream.cpp:54   | [#default#push/load73(588761627)] Trying to create a mediarouter stream
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Monitor | application_metrics.cpp:57   | Create StreamMetrics(load73/80d32c8a-d2d7-4807-b8bc-7146afb6f075/default/#default#push/load73/o) for monitoring
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Transcoder | transcoder_stream.cpp:92   | [#default#push/load73(100)] Transcoder stream has been started.
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Transcoder | transcoder_stream.cpp:1371 | [#default#push/load73(100)] -> [#default#push/load73(588761627)] Transcoder output stream has been deleted.
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_application.cpp:480  | [#default#push/load73(588761627)] Trying to delete a stream
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Monitor | application_metrics.cpp:69   | Delete StreamMetrics(load73/80d32c8a-d2d7-4807-b8bc-7146afb6f075/default/#default#push/load73/o) for monitoring
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Monitor | stream_metrics.cpp:31   | 
local-stream_1        | [Stream Info]
local-stream_1        | id(588761627), msid(0), output(load73), SourceType(Transcoder), RepresentationType(Source), Created Time (Thu Jan 12 20:03:23 2023) UUID(80d32c8a-d2d7-4807-b8bc-7146afb6f075/default/#default#push/load73/o)
local-stream_1        |     >> Origin Stream Info
local-stream_1        |     id(100), output(load73), SourceType(Rtmp), Created Time (Thu Jan 12 20:02:59 2023)
local-stream_1        | 
local-stream_1        |     Video Track #0: Name(bypass_video) Bitrate(0b) Codec(1,H264,Passthrough) BSF(H264_AVCC) Resolution(640x380) Framerate(30.00fps) KeyInterval(0) BFrames(0) timebase(1/1000)
local-stream_1        |     Data  Track #1: Name(E9bZFTWC) Codec(0,Unknown,Passthrough) BSF(Unknown) timebase(1/1000)
local-stream_1        | 
local-stream_1        |     >> Statistics
local-stream_1        |     Last update time : Thu Jan 12 20:03:23 2023, Last sent time : Thu Jan 12 20:03:23 2023
local-stream_1        |     Bytes in : 0B, Bytes out : 0B, Concurrent connections : 0, Max connections : 0 (Thu Jan 12 20:03:23 2023)
local-stream_1        | 
local-stream_1        |         >>>> By publisher
local-stream_1        |         - Unknown : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - WebRTC : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - MPEGTSPush : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - RTMPPush : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - HLS : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - DASH : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - LLDASH : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - LLHLS : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - OVT : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - File : Bytes out(0B) Concurrent Connections (0)
local-stream_1        |         - Thumbnail : Bytes out(0B) Concurrent Connections (0)
local-stream_1        | 
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_stream.cpp:72   | [#default#push/load73(588761627)] Delete mediarouter stream 
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] Transcoder | transcoder_stream.cpp:106  | [#default#push/load73(100)] Transcoder stream has been stopped.
local-stream_1        | [2023-01-12 20:03:23.967] I [SPRTMP-T1935:60] MediaRouter | mediarouter_stream.cpp:72   | [#default#push/load73(100)] Delete mediarouter stream 
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621553->0
local-stream_1        | [2023-01-12 20:03:23.968] W [SPRTMP-T1935:60] ov.Queue | queue.h:268  | [0x7f20b2aab058] #default#push/ca315-MR-Inbound size has exceeded the threshold: queue: 100, threshold: 100, peak: 100
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621554->33
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621555->67
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621556->100
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621557->133
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621558->167
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621559->200
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621560->233
local-stream_1        | [2023-01-12 20:03:23.968] W [OutboundWorker:66] MediaRouter | mediarouter_stream.cpp:1039 | [#default#push/ca315] Detected out of order DTS of packet. track_id:0 dts:7621561->267
local-stream_1        | [2023-01-12 20:03:23.968] W [InboundWorker:64] MediaRouter | mediarouter_stream.cpp:799  | [#default#push/ca315(100)] b-frame has been detected in the 0 track of inbound stream
getroot commented 1 year ago

Have you tuned the network kernel values? Please check this first. And it's hard to analyze the problem with just that part of the log. I often test 200+ rtmp inputs and I haven't encountered the same problem yet. When you encode with ffmpeg, if you set the bitrate to cbr 1 meg, does the result change? And what about the cpu usage per thread of the ome process when the problem occurs?

oleglab commented 1 year ago

@getroot I tried so many different cases, with default kernel, and with setup suggested in this bug. I don't think that traffic is an issue, 100 ffmpeg doesn't create a lot of load actually. We modified our test case to play pre-recorded video instead of encoding it with test stream.

We were able to feed 600 rtmps without traffic from the edges, but the problems start with 64+ rtmp streams:

1) streams get pushed out by new ones. 2) with traffic from edges origin gets stuck. I think access control hook has to be enabled to reproduce it.

I am going to send you a video with our steps and setup to reproduce the issue in email.

oleglab commented 1 year ago

Here is the kernel settings that we apply. The only difference is net.ipv4.tcp_fin_timeout=30 to be safe with GCP NAT:

sysctl -w \
 net.core.somaxconn=65535 \
 net.ipv4.tcp_congestion_control=bbr \
 net.ipv4.tcp_max_tw_buckets=1440000 \
 net.ipv4.ip_local_port_range="1024 65000" \
 net.ipv4.tcp_window_scaling=1 \
 net.ipv4.tcp_max_syn_backlog=3240000 \
 net.core.rmem_max=16777216 \
 net.core.wmem_max=16777216 \
 net.core.rmem_default=16777216 \
 net.core.wmem_default=16777216 \
 net.core.optmem_max=40960 \
 net.ipv4.tcp_rmem="4096 87380 16777216" \
 net.ipv4.tcp_wmem="4096 65536 16777216" \
 net.core.netdev_max_backlog=50000 \
 net.ipv4.tcp_max_syn_backlog=30000 \
 net.ipv4.tcp_max_tw_buckets=2000000 \
 net.ipv4.tcp_tw_reuse=1 \
 net.ipv4.tcp_fin_timeout=30 \
 net.ipv4.tcp_slow_start_after_idle=0 \
 net.ipv4.tcp_keepalive_time=7200 \
 net.ipv4.tcp_keepalive_intvl=75 \
 net.ipv4.tcp_keepalive_probes=9
getroot commented 1 year ago

Thanks for the detailed report. I'll look into this.

getroot commented 1 year ago

I saw the video and configuration file in the email you sent. I'm a bit confused on how to reproduce it because the contents of this article and the email are different.

Could you explain the repro path again step by step? Your email is below.

  1. Set one OvenMediaEngine

    Enable RTMP Provider and RTSP Pull Provider BlockDuplicateStreamName of RTMP is set to false Enable RTMP Push Publisher and WebRTC Publisher Disable SingedPolicy and AdmissionWebhooks

  2. Play one stream through RTSP Pull (stream name is /push/rtsp)

    Origin is rtsp://xxx:8554/dev-xxxxx (not OvenMediaEngine server)

  3. Input RTMP streams to the OME using FFMPEG (stream name is /push/load$1)

  4. /push/rtsp stream is overwritten by one stream from /push/load$1

getroot commented 1 year ago

And the log in the video is a bit incomprehensible. The player is trying to play /push/rtsp, but the log doesn't print anything about it, only a trail of trying and failing to play /push/load500. Is ws://localhost:3333/push/rtsp making a request to another OvenMediaEngine server that isn't getting recorded video? Is it composed of Origin-Edge, and only Origin appears in the video, and Edge has a different configuration? So, is the Server.xml you sent a configuration for Origin? Please clear up my confusion.

oleglab commented 1 year ago

@getroot The idea was to provide you the simplest configuration which helps to reproduce the issue with 64+ streams, as you can see on the video, the actual number somewhere between 70 and 75.

1) Set one OvenMediaEngine:

Enable RTMP Provider and RTSP Pull Provider BlockDuplicateStreamName of RTMP is set to false Enable RTMP Push Publisher and WebRTC Publisher Disable SingedPolicy and AdmissionWebhooks

2) Open the pull stream in demo player and keep it running. Stream name /push/rtsp

3) Injest 64-70 RTMP streams to the OME using FFMPEG into the same application (stream name is /push/load$1)

4) Verify that /push/rtsp is still running

3) Injest 10 more RTMP streams to have 75 total

4) Verify that /push/rtsp was pushed out by rtmp/ffmpeg stream.

The same problem we experience in our cluster with multiple origins and edges: Origin performs very well while we start/injest ~64 RTMP streams, but as soon as it gets few more streams the OVT connections get dropped in cascade, origin get stuck and/or streams get mixed/pushed out.

oleglab commented 1 year ago

Reproducible with both dev and latest docker containers

Step 1:

docker run -v $PWD/test-pull/Server.xml:/opt/ovenmediaengine/bin/origin_conf/Server.xml -p 1935:1935 -p 3333:3333 -p 3478:3478 -p 10006-10010:10006-10010 airensoft/ovenmediaengine:latest

Step 2: Verify that stream is running (Chrome preferably)

Step 3:

for i in {1..70}; do ./ff.sh $i; done

Step 4 (Repeat step 2): Verify that stream is still running

Step 5:

for i in {70..75}; do ./ff.sh $i; done

Step 6: Verify that stream is different

getroot commented 1 year ago

Thanks for the detailed reproduction path. I'll start this as soon as the task I'm doing now is finished.

getroot commented 1 year ago

Thanks for reporting the bug. I patched a problem where stream ids generated by different types of providers could be duplicated and committed it to the master branch. Please check if the image airensoft/ovenmediaengine:dev solves your problem.

oleglab commented 1 year ago

Hi @getroot, thanks for the update. The dev build looks good! We couldn't reproduce the issue!