algora-io / tv

Open source Twitch for developers
https://algora.tv
Other
1.08k stars 69 forks source link

Reconnect RTMP #113

Closed lastcanal closed 2 weeks ago

lastcanal commented 1 month ago

This PR allows RTMP connections to re-connect to the Pipeline.

syn gets added as a global registry for pipelines membrane_rtmp_plugin gets upgraded to the latest version which requires Erlang 27.1

I also added a .tool-versions file with the following:

elixir 1.17.3-otp-27
erlang 27.1

Running asdf install in the project directory will install the correct Elixir/Erlang

A new environment variable is required to activate reconnects:

RESUME_RTMP=true

This PR supports:

Screencast from 2024-10-27 13-28-16.webm

While I have attempted to configure HLS.js to smoothly re-start the video after it exhausts it's buffers, it doesn't always work and a refresh is sometimes required. If re-connect happens quickly, before the viewer's HLS.js buffer has exhausted, then the transition is fully steam-less.

Reconnects are accomplished by keeping the Pipeline in the running state until a timeout is reached, attaching and detaching input pads as required. Pipelines are globally registered via :syn, indexed by stream key, allowing a re-connecting RTMP stream to find it's running pipeline and having the side-effect of allowing only one pipeline per user. Thanks to Erlang and Membrane magic, RTMP streams can re-connect to a different server from the one that is running their pipeline, allowing round-robin DNS. FWIW I have only tested this with different VMs, not different hosts.

Static pads deep in the muxed_av path of HLS's SinkBin were preventing re-connects. To overcome this separate_av is required, which was not supported with low-latency storage. This PR also adds multiple manifest support to the low-latency storage adapter. This change seems to fix #93 .

Included is a de-bouncing upload manager for manifests. This fixes a race condition where larger manifests containing ll-hls parts would finish uploading after the smaller re-written final manifest, overwriting the final manifest due to Tigris's LWW semantics. This also saves many unnecessary POST requests to Tegris with even a small de-bounce window.

Having separate manifests introduces a number of regressions:

Some of the things I have yet to test:

I've also added a new helper method Algora.Admin.terminate_pipelines!/0 to gracefully terminate any running pipeline, useful when testing, and imported some of the low-latency storage tests from Fishjam.

Some follow-up PRs that are possible:

/claim #73

zcesur commented 4 weeks ago

This is super cool and works really well @lastcanal! Sometimes the player needs a refresh as you mentioned, but that's totally fine

I really like the debouncing upload manager and the partial segments on final manifest fix. The latter has been bugging me for quite a while haha

RTMP forwarding doesn't seem to work yet -- Terminating with reason: {:membrane_child_crash, :rtmp_sink_0, {%RuntimeError{message: "writing video frame failed with reason: \"Invalid argument\""}, do you think that would be a simple fix?

Thanks so much for your amazing work on this Ty! Would love to meet you over a call with Ioannis if you're free one of these days, sometime evening in Europe https://cal.com/zafercesur

lastcanal commented 4 weeks ago

Thanks @zcesur! I've pushed a fix for forwarding RTMP, hopefully it works for you too! The forwarded streams will disconnect then reconnect when the source returns and the chat websockets will stay open until the pipeline terminates.

I'd love to do a video call with you and Ioannis! Expect a meeting request soon :)

lastcanal commented 3 weeks ago

I've reverted my fix for forwarding RTMP and included it in #116. My branch that adds abr is ready and that force-pushed commit would have caused many a rebase conflict, requiring squashing and cherry-picking. I've rebased #116 branch against main and it includes all commits on this branch, hopefully it's not too big to review.

zcesur commented 3 weeks ago

I've pushed a fix for forwarding RTMP, hopefully it works for you too! The forwarded streams will disconnect then reconnect when the source returns and the chat websockets will stay open until the pipeline terminates.

Yeah it works!

Occasionally the pipeline crashes due to duplicate sink names, can we handle that somehow?

[error] GenServer #PID<0.3624.0> terminating
** (Membrane.ParentError) Duplicated names in children specification: [:rtmp_sink_2]
    (membrane_core 1.0.1) lib/membrane/core/parent/child_life_controller/startup_utils.ex:27: Membrane.Core.Parent.ChildLifeController.StartupUtils.check_if_children_names_unique/2
    (membrane_core 1.0.1) lib/membrane/core/parent/child_life_controller.ex:240: Membrane.Core.Parent.ChildLifeController.setup_children/3
    (elixir 1.17.3) lib/enum.ex:1305: anonymous fn/3 in Enum.flat_map_reduce/3
    (elixir 1.17.3) lib/enum.ex:4858: Enumerable.List.reduce/3
    (elixir 1.17.3) lib/enum.ex:1304: Enum.flat_map_reduce/3
    (membrane_core 1.0.1) lib/membrane/core/parent/child_life_controller.ex:141: Membrane.Core.Parent.ChildLifeController.handle_spec/2
    (membrane_core 1.0.1) lib/membrane/core/callback_handler.ex:197: anonymous fn/5 in Membrane.Core.CallbackHandler.handle_callback_result/5
    (elixir 1.17.3) lib/enum.ex:2531: Enum."-reduce/3-lists^foldl/2-0-"/3
    (membrane_core 1.0.1) lib/membrane/core/callback_handler.ex:195: Membrane.Core.CallbackHandler.handle_callback_result/5
    (membrane_core 1.0.1) lib/membrane/core/pipeline.ex:155: Membrane.Core.Pipeline.handle_info/2
    (stdlib 6.1) gen_server.erl:2345: :gen_server.try_handle_info/3
    (stdlib 6.1) gen_server.erl:2433: :gen_server.handle_msg/6
    (stdlib 6.1) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
lastcanal commented 3 weeks ago

I've pushed a change to the adaptive bitrate branch that makes the RTMP forwarding child names unique by including the reconnect count in the name: {rtmp_sink_<index>, <reconnect>}. Maybe the outgoing RTMP connections were slow to shut down and weren't removed before reconnect happened. It's possible this will cause problems with the upstream RMTP server (youtube, etc.) if a new connection comes in while the old one is still active.. If this is an issue we can delay reconnecting until the disconnecting connection has actually closed.