AirenSoft / OvenMediaEngine

OvenMediaEngine (OME) is a Sub-Second Latency Live Streaming Server with Large-Scale and High-Definition. #WebRTC #LLHLS
https://airensoft.com/ome.html
GNU Affero General Public License v3.0
2.61k stars 1.06k forks source link

Edge server can permanently lose connectivity to Origin after Origin server restart #1530

Closed d-uzlov closed 1 month ago

d-uzlov commented 9 months ago

Describe the bug

Origin server doesn't watch for the entry in OriginMapStore to be valid.

When I restart OME origin server, sometimes Edge fails to re-connect to new origin server. It doesn't reconnect even if I wait several minutes. I tracked it down to the fact that redis no longer contains an entry for the stream, and reproduced it without an Edge server.

To Reproduce

  1. Set up Redis
  2. Enable OriginMapStore in Origin server config:
    <Server version="8">
    <VirtualHosts>
            <OriginMapStore>
                <RedisServer>
                    <Host>${env:REDIS_SERVICE:redis}:${env:REDIS_PORT:6379}</Host>
                    <Auth>${env:REDIS_PASS}</Auth>
                </RedisServer>
                <OriginHostName>${env:OME_HOST_IP}</OriginHostName>
            </OriginMapStore>
        </VirtualHost>
    </VirtualHosts>
    </Server>
  3. Start OME origin server
  4. Start the stream in OBS with automatic reconnect
  5. Restart OME origin server (it may matter that I kill the docker container forcefully after 1s timeout, since it doesn't seem to respond to SIGTERM)
  6. Check the list of Redis keys: redis-cli -a "$redis_pass" keys "*"

You may need to try the restart several times, for me it reproduces roughly 1/3 of restarts. Also, you may need to check redis several times, for me there was some delay between restart and redis entry disappearing.

Expected behavior

Origin server updates OriginMapStore to put a valid entry for re-connected stream.

Actual behavior

Redis entry somehow gets deleted. Either by Redis itself, or by the Origin server.

It seems like Origin server tries to register the stream in OriginMapStore only once, when stream starts, and doesn't update / maintain it during the stream.

Logs

[2024-02-20 04:21:23.451] I [OutboundWorker:32] MediaRouter | mediarouter_application.cpp:477  | [#default#tc/danil(1818817424)] Stream has been prepared 
[Stream Info]
id(1818817424), msid(0), output(danil), SourceType(Transcoder), RepresentationType(Source), Created Time (Tue Feb 20 04:21:22 2024) UUID(72382f59-0bdf-49ae-97d6-e7c401cf3943/default/#default#tc/danil/o)
    >> Origin Stream Info
    id(100), output(danil), SourceType(Rtmp), Created Time (Tue Feb 20 04:21:22 2024)

    Video Track #0: Public Name(Video_0) Variant Name(bypass_video) Bitrate(25.00Mb) Codec(1,H264,Passthrough:0) BSF(AVCC) Resolution(2560x1440) Framerate(60.00) KeyInterval(60/frame) SkipFrames(0) BFrames(0) timebase(1/1000)
    Video Track #1: Public Name(Video_0) Variant Name(720p) Bitrate(2.50Mb) Codec(1,H264,default:0) BSF(H264_ANNEXB) Resolution(1280x720) Framerate(60.00) KeyInterval(60/frame) SkipFrames(0) BFrames(0) timebase(1/90000)
    Video Track #2: Public Name(Video_0) Variant Name(1080p) Bitrate(8.00Mb) Codec(1,H264,default:0) BSF(H264_ANNEXB) Resolution(1920x1080) Framerate(60.00) KeyInterval(60/frame) SkipFrames(0) BFrames(0) timebase(1/90000)
    Video Track #3: Public Name(Video_0) Variant Name(image_0) Bitrate(1.00Mb) Codec(9,JPEG,default:0) BSF(JPEG) Resolution(854x480) Framerate(1.00) KeyInterval(0/frame) SkipFrames(0) BFrames(0) timebase(1/90000)
    Audio Track #4: Public Name(Audio_1) Variant Name(aac) Bitrate(128.00Kb) Codec(6,AAC,Passthrough) BSF(AAC_RAW) Samplerate(48.0K) Format(fltp, 32) Channel(stereo, 2) timebase(1/1000)
    Audio Track #5: Public Name(Audio_1) Variant Name(opus) Bitrate(128.00Kb) Codec(8,OPUS,default) BSF(OPUS) Samplerate(48.0K) Format(s16, 16) Channel(stereo, 2) timebase(1/48000)
    Data  Track #6: Public Name(Data_2) Variant Name(Data) Codec(0,Unknown,Passthrough) BSF(ID3v2) timebase(1/1000)
[2024-02-20 04:21:23.452] I [OutboundWorker:32] WebRTC Publisher | rtc_stream.cpp:163  | RtcStream(#default#tc/danil) - Ignore unsupported codec(JPEG)
[2024-02-20 04:21:23.452] I [OutboundWorker:32] WebRTC Publisher | rtc_stream.cpp:163  | RtcStream(#default#tc/danil) - Ignore unsupported codec(AAC)
[2024-02-20 04:21:23.452] I [OutboundWorker:32] WebRTC Publisher | rtc_stream.cpp:200  | WebRTC Stream has been created : danil/1818817424
Rtx(false) Ulpfec(false) JitterBuffer(false) PlayoutDelay(false min:0 max: 0)
[2024-02-20 04:21:23.452] I [OutboundWorker:32] Publisher | stream.cpp:212  | WebRTC Publisher Application application has started [danil(1818817424)] stream (MSID : 0)
[2024-02-20 04:21:23.452] I [OutboundWorker:32] LLHLS Publisher | llhls_stream.cpp:118  | LLHlsStream(#default#tc/danil) - Ignore unsupported codec(JPEG)
[2024-02-20 04:21:23.452] I [OutboundWorker:32] LLHLS Publisher | llhls_stream.cpp:118  | LLHlsStream(#default#tc/danil) - Ignore unsupported codec(OPUS)
[2024-02-20 04:21:23.453] I [OutboundWorker:32] LLHLS Publisher | llhls_stream.cpp:213  | LLHlsStream has been created : danil/1818817424
OriginMode(true) Chunk Duration(0.50) Segment Duration(6) Segment Count(10) DRM(none)
[2024-02-20 04:21:23.453] I [OutboundWorker:32] Publisher | stream.cpp:212  | LLHLS Publisher Application application has started [danil(1818817424)] stream (MSID : 0)
[2024-02-20 04:21:23.455] E [OutboundWorker:32] OriginMapClient | origin_map_client.cpp:71   | <tc/danil> stream is already registered.
[2024-02-20 04:21:23.455] W [OutboundWorker:32] OVT | ovt_stream.cpp:46   | Failed to register stream to origin map store : #default#tc/danil
[2024-02-20 04:21:23.456] I [OutboundWorker:32] Publisher | stream.cpp:212  | ThumbnailPublisher Application application has started [danil(1818817424)] stream (MSID : 0)
[2024-02-20 04:21:23.456] W [OutboundWorker:32] Publisher | application.cpp:287  | OVTPublisher Application could not start [danil] stream.
[2024-02-20 04:27:24.387] I [SPRTMP-t1935:20] RTMPProvider | rtmp_provider.cpp:216  | The RTMP client has disconnected: [#default#tc/danil], remote: <ClientSocket: 0x7f9b33e01010, #18, Disconnected, TCP, Nonblocking, 10.0.1.1:50064>

Server (please complete the following information):

Additional context

Here is an example of how I check the issue using OME deployment in Kubernetes. OriginMapStore entry seems to be present in Redis for a few more seconds after the Origin server is restarted, but then it's just gone.

$ kl -n ome get pod
NAME                              READY   STATUS    RESTARTS   AGE
ome-origin-cpu-5d774bff5f-64dl6   1/1     Running   0          8s
redis-7557f5946c-678nl            1/1     Running   0          20m
danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome exec deployments/redis -- redis-cli -a "$redis_pass" keys "*"
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
tc/danil
danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome delete pod ome-origin-cpu-5d774bff5f-64dl6 
pod "ome-origin-cpu-5d774bff5f-64dl6" deleted
danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome exec deployments/redis -- redis-cli -a "$redis_pass" keys "*"
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
tc/danil
danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome exec deployments/redis -- redis-cli -a "$redis_pass" keys "*"
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
tc/danil
danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome exec deployments/redis -- redis-cli -a "$redis_pass" keys "*"
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome exec deployments/redis -- redis-cli -a "$redis_pass" keys "*"
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

danil@danil-main:/mnt/c/Users/danil/Documents/k8s-public-copy
$ kl -n ome exec deployments/redis -- redis-cli -a "$redis_pass" keys "*"
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
getroot commented 9 months ago

Yes, guessing the scenario, I think this issue will recur. While resolving conflicts from multiple servers, I need to think about how to solve this problem. Thanks for reporting the bug.

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

d-uzlov commented 2 months ago

@getroot could you reopen this, please?

getroot commented 2 months ago

Sorry for the long wait.

The issue was that the app/stream key was already registered in redis, so when origin was turned back on and tried to register it, it was rejected. (Redis key expire is set to 10 seconds.)

I solved this issue. Please test with the latest master branch and see if it solves your issue. Thanks.

d-uzlov commented 2 months ago

Thank you for the fix.

I tried to switch to container docker.io/airensoft/ovenmediaengine@sha256:a2438890d695522a9c763d14c5fa57626199800e9afdae54ff8877ce4635fe88 from the dev tag, which was created 3 days ago and, if I understand it correctly, should correspond to the last commit in the master branch.

The issue with the redis entry seems to be fixed. It doesn't disappear, and in the container logs I see repeated stream is already registered with different origin host with eventual stream is registered with origin host. Also, if I manually delete the redis entry, it seems like it gets re-created.

However, I can't really test if it works with the edge server because OVT seems to be broken for me.

When I try to watch the stream via the edge, the edge logs look like this:

[2024-09-28 03:06:05.803] I [SPRtcSig-t3333:8] Signalling | rtc_signalling_server.cpp:316  | New client is connected: <ClientSocket: 0x71f72c605a10, #14, Connected, TCP, Nonblocking, 10.201.3.174:38782>
[2024-09-28 03:06:05.807] I [SPRtcSig-t3333:8] Publisher | publisher.cpp:179  | Try to pull stream from local origin map: [#default#tc/danil]
[2024-09-28 03:06:05.807] I [SPRtcSig-t3333:8] Orchestrator | orchestrator.cpp:1247 | Could not find the origin for the location: /tc/danil
[2024-09-28 03:06:05.807] I [SPRtcSig-t3333:8] Publisher | publisher.cpp:183  | Try to pull stream from origin map store: [#default#tc/danil]
[2024-09-28 03:06:05.807] E [SPRtcSig-t3333:8] Orchestrator | orchestrator.cpp:541  | Could not find Origin for the stream: [#default#tc/danil]
[2024-09-28 03:06:05.809] I [SPRtcSig-t3333:8] Orchestrator | orchestrator.cpp:483  | Trying to create dynamic application for the stream: [#default#tc/danil]
[2024-09-28 03:06:05.809] I [SPRtcSig-t3333:8] Orchestrator | orchestrator.cpp:1297 | Trying to create an application: [#default#tc]
[2024-09-28 03:06:05.809] I [SPRtcSig-t3333:8] Monitor | host_metrics.cpp:58   | Create ApplicationMetrics(#default#tc/6b9eb881-a269-40e0-a5c3-4d41aec64cdd/default/#default#tc) for monitoring
[2024-09-28 03:06:05.809] I [SPRtcSig-t3333:8] MediaRouter | mediarouter_application.cpp:52   | [#default#tc(1)] Created Mediarouter application. Worker(1) DelayBufferTime(0)
[2024-09-28 03:06:05.810] I [SPRtcSig-t3333:8] MediaRouter | mediarouter.cpp:83   | Created Mediarouter. app(#default#tc)
[2024-09-28 03:06:05.812] I [SPRtcSig-t3333:8] Publisher | application.cpp:226  | WebRTC Publisher Application has created [#default#tc] application
[2024-09-28 03:06:05.812] I [SPRtcSig-t3333:8] Publisher | application.cpp:226  | LLHLS Publisher Application has created [#default#tc] application
[2024-09-28 03:06:05.813] I [SPRtcSig-t3333:8] Transcoder | transcoder_application.cpp:34   | Created transcoder application. [#default#tc]
[2024-09-28 03:06:05.812] W [SPRtcSig-t3333:8] Publisher | publisher.cpp:68   | File publisher is disabled in #default#tc application, so it was not created
[2024-09-28 03:06:05.812] W [SPRtcSig-t3333:8] Publisher | publisher.cpp:68   | Push publisher is disabled in #default#tc application, so it was not created
[2024-09-28 03:06:05.812] W [SPRtcSig-t3333:8] Publisher | publisher.cpp:68   | Thumbnail publisher is disabled in #default#tc application, so it was not created
[2024-09-28 03:06:05.813] I [SPRtcSig-t3333:8] Transcoder | transcoder.cpp:94   | Transcoder has created [dynamic][#default#tc] application
[2024-09-28 03:06:05.813] I [SPRtcSig-t3333:8] Provider | application.cpp:46   | OVTProvider has created [#default#tc] application
[2024-09-28 03:06:05.813] W [SPRtcSig-t3333:8] Provider | provider.cpp:91   | RTSP Pull provider is disabled in #default#tc application, so it was not created
[2024-09-28 03:06:05.813] W [SPRtcSig-t3333:8] Provider | provider.cpp:91   | Scheduled provider is disabled in #default#tc application, so it was not created
[2024-09-28 03:06:05.813] W [SPRtcSig-t3333:8] Provider | provider.cpp:91   | Multiplex provider is disabled in #default#tc application, so it was not created
[2024-09-28 03:06:05.813] I [SPRtcSig-t3333:8] Orchestrator | orchestrator.cpp:506  | Trying to pull stream [#default#tc/danil] from provider using URL: PullProvider
[2024-09-28 03:06:07.333] E [SPRtcSig-t3333:8] OvtStream | ovt_stream.cpp:633  | [#default#tc/danil] An error occurred while receiving packet: [Socket] Receive timed out: Resource temporarily unavailable (11)
[2024-09-28 03:06:07.333] E [SPRtcSig-t3333:8] OvtStream | ovt_stream.cpp:608  | #default#tc/danil(103) - Could not receive packet : err(0)
[2024-09-28 03:06:08.869] E [SPRtcSig-t3333:8] OvtStream | ovt_stream.cpp:633  | [#default#tc/danil] An error occurred while receiving packet: [Socket] Receive timed out: Resource temporarily unavailable (11)
[2024-09-28 03:06:08.869] E [SPRtcSig-t3333:8] OvtStream | ovt_stream.cpp:608  | #default#tc/danil(103) - Could not receive packet : err(0)
[2024-09-28 03:06:10.405] E [SPRtcSig-t3333:8] OvtStream | ovt_stream.cpp:633  | [#default#tc/danil] An error occurred while receiving packet: [Socket] Receive timed out: Resource temporarily unavailable (11)
[2024-09-28 03:06:10.405] E [SPRtcSig-t3333:8] OvtStream | ovt_stream.cpp:608  | #default#tc/danil(103) - Could not receive packet : err(0)
[2024-09-28 03:06:10.405] I [SPRtcSig-t3333:8] Provider | stream.cpp:65   | #default#tc/danil(103) has been stopped playing stream
[2024-09-28 03:06:10.406] E [SPRtcSig-t3333:8] Provider | provider.cpp:154  | OVTProvider Application could not create [danil] stream.
[2024-09-28 03:06:10.406] E [SPRtcSig-t3333:8] Orchestrator | orchestrator.cpp:519  | Could not pull stream [#default#tc/danil] from provider: PullProvider
[2024-09-28 03:06:10.406] E [SPRtcSig-t3333:8] WebRTC Publisher | webrtc_publisher.cpp:316  | Cannot find stream (#default#tc/danil)
[2024-09-28 03:06:10.406] E [SPRtcSig-t3333:8] Signalling | rtc_signalling_server.cpp:407  | Cannot find stream [#default#tc/danil]
[2024-09-28 03:06:10.408] I [SPRtcSig-t3333:8] Signalling | rtc_signalling_server.cpp:450  | Client is disconnected: HttpConnection(0x71f72c605210) : WebSocket <ClientSocket: 0x71f72c605a10, #14, Connected, TCP, Nonblocking, 10.201.3.174:38782> TLS(Disabled) (#default#tc / danil, ufrag: local: (N/A), remote: (N/A))
[2024-09-28 03:06:10.409] I [SPRtcSig-t3333:8] HTTP.Server | http_server.cpp:216  | Client(<ClientSocket: 0x71f72c605a10, #14, Closed, TCP, Nonblocking, 10.201.3.174:38782>) has been disconnected by *:3333

The origin logs look like this:

[2024-09-28 03:09:54.085] I [DQOMapC:21] OriginMapClient | origin_map_client.cpp:156  | OriginMapStore: <tc/danil> stream is registered with origin host : ovt://10.201.3.176:9000/tc/danil
[2024-09-28 03:10:02.679] I [SPOvtPub-t9000:14] OVT | ovt_publisher.cpp:181  | OvtProvider is connected : <ClientSocket: 0x71e6cd801010, #22, Connected, TCP, Nonblocking, 10.201.3.55:55910>
[2024-09-28 03:10:03.641] I [SPAPISvr-t8081:37] HTTP.Server | http_server.cpp:149  | Client(<ClientSocket: 0x71e6caa0e010, #23, Connected, TCP, Nonblocking, 10.201.3.169:44444>) is connected on *:8081
[2024-09-28 03:10:03.642] I [SPAPISvr-t8081:37] HTTP.Server | http_server.cpp:223  | Client(<ClientSocket: 0x71e6caa0e010, #23, Disconnected, TCP, Nonblocking, 10.201.3.169:44444>) has disconnected from *:8081
[2024-09-28 03:10:05.689] I [SPOvtPub-t9000:14] OVT | ovt_publisher.cpp:266  | OvtProvider is disconnected(2) : <ClientSocket: 0x71e6cd801010, #22, Disconnected, TCP, Nonblocking, 10.201.3.55:55910>
[2024-09-28 03:10:05.689] I [SPOvtPub-t9000:14] OVT | ovt_publisher.cpp:181  | OvtProvider is connected : <ClientSocket: 0x71e6cd801210, #22, Connected, TCP, Nonblocking, 10.201.3.55:55926>
[2024-09-28 03:10:08.693] I [SPOvtPub-t9000:14] OVT | ovt_publisher.cpp:266  | OvtProvider is disconnected(2) : <ClientSocket: 0x71e6cd801210, #22, Disconnected, TCP, Nonblocking, 10.201.3.55:55926>
[2024-09-28 03:10:08.694] I [SPOvtPub-t9000:14] OVT | ovt_publisher.cpp:181  | OvtProvider is connected : <ClientSocket: 0x71e6cd801410, #22, Connected, TCP, Nonblocking, 10.201.3.55:55932>
[2024-09-28 03:10:11.697] I [SPOvtPub-t9000:14] OVT | ovt_publisher.cpp:266  | OvtProvider is disconnected(2) : <ClientSocket: 0x71e6cd801410, #22, Disconnected, TCP, Nonblocking, 10.201.3.55:55932>
[2024-09-28 03:10:13.641] I [SPAPISvr-t8081:37] HTTP.Server | http_server.cpp:149  | Client(<ClientSocket: 0x71e6caa0e010, #22, Connected, TCP, Nonblocking, 10.201.3.169:58440>) is connected on *:8081
[2024-09-28 03:10:13.642] I [SPAPISvr-t8081:37] HTTP.Server | http_server.cpp:223  | Client(<ClientSocket: 0x71e6caa0e010, #22, Disconnected, TCP, Nonblocking, 10.201.3.169:58440>) has disconnected from *:8081

This only happens when I use the current dev tag for the origin container. If I revert the origin to 0.16.5 tag, everything seems to work fine, both with the old and the new edge.

Should I create a new issue to track this, or is it something related to the old redis issue?

getroot commented 2 months ago

I can't reproduce your issue.

I'm testing with 1 edge and 1 origin, and I've checked if it works fine when I restart origin.

When origin restarts, the following log doesn't appear. This is because I overwrite the same origin host registered in redis.

The below log is printed only when there is a different ovt address registered with the same key (app/stream) in redis. "stream is already registered with different origin host" That is, it is printed when 2 origins try to register the same app/stream.

Maybe you did ${env:OME_HOST_IP}, so the IP may have changed. If so, that's not the cause of the issue.

What should I do to reproduce your issue? Any ideas?

Are you testing with the latest version of both origin and edge? Or are you using a different version of edge?

And to help me see the issue more precisely, could you show me your edge log and origin log from the same time frame? I'd like to see what the origin log looks like when that error occurs on edge.

Also, is there a problem that you can't play with edge in general situations, not special situations where origin restarts?

getroot commented 2 months ago

Could it be the Edge is connected to Origin which is being shut down?

d-uzlov commented 1 month ago

Sorry for the late reply.

The below log is printed only when there is a different ovt address registered with the same key (app/stream) in redis. "stream is already registered with different origin host" That is, it is printed when 2 origins try to register the same app/stream.

Maybe you did ${env:OME_HOST_IP}, so the IP may have changed. If so, that's not the cause of the issue.

Yes, this is almost always the case in Kubernetes. New pod always gets a new IP. Also, I imagine the same would happen with bare metal deployment, in case a streamer connects to origin via a loadbalancer that can choose between several origins.

This isn't a big problem, it sorts itself out eventually. Maybe stream registering is delayed a bit, I don't mind it.

Also, is there a problem that you can't play with edge in general situations, not special situations where origin restarts?

Yes, this is the issue for me. While I can see in the logs and in Redis that missing Redis entry is fixed, I also wanted to test it in real life scenario, just to be sure. But I can't, it doesn't play. Even without any restarts. Edge doesn't want to connect to Origin at all.

I'm pretty sure it's a different issue. But I see a possibility that this new issue was created when fixing this one, so I'm not sure if I should post a new Github issue for this. I didn't have time to debug it and check different commits.

Are you testing with the latest version of both origin and edge? Or are you using a different version of edge?

For my testing I used docker.io/airensoft/ovenmediaengine:dev image, which resolved to docker.io/airensoft/ovenmediaengine@sha256:a2438890d695522a9c763d14c5fa57626199800e9afdae54ff8877ce4635fe88. Which, judging by the creation time, should correspond to this commit: https://github.com/AirenSoft/OvenMediaEngine/commit/900589b1db7e47d9235a04f9b4c0c82fd27875d1

I used this image for Origin server, to check this Redis issue. For Edge I tried using both this image and the 0.16.5 release. Both refused to work with the Origin from :dev.

d-uzlov commented 1 month ago

I sorry to say that I can't reproduce it right now, even when using these tags and the same config. It must be that something in my environment changed. Which makes it even stranger that when I tested this a few days ago only the new version of OME had issues.

Sorry for bothering you.

getroot commented 1 month ago

If the origin server dies abnormally, the ovt address will still remain in redis. After 10 seconds, this information will be deleted from redis, and the new origin will be able to register its ovt address in redis. My guess is that the problem may occur when the edge tries to connect to the dead origin server while the information is still remaining in redis after the first origin dies. Of course, this is my guess.