algora-io / tv

Open source Twitch for developers
https://algora.tv
Other
1.08k stars 69 forks source link

Adaptive bitrate livestreams #116

Closed lastcanal closed 1 week ago

lastcanal commented 3 weeks ago

Hello! This PR includes all of #113 and adds adaptive bit-rate transcoding to the live streaming pipeline. The wonderful team over at Membrane have merged the changes I needed to the RTMP plugin and made this PR possible. Although not everything in this PR has been tested, anything un-tested is only available behind a feature flag. I have added issue #115 to track these features as they are tested.

3 transcoding backends are available: Software (ffmpeg), Nvidia, and Xilinx. I have only tested the Software backend. Transcoding in the pipeline can be configured with the TRANSCODING environment variable. The format is <height>p<framerate>@<bitrate> separated by a pipe |. For example: 1440p30@4000000|720p30@2000000.

To avoid overloading the application servers I have setup the pipelines to run inside FLAME, which will in theory use Fly to boot a new server and start the pipeline there. I have only tested the local backend.

H265 video is supported, but unfortunately Vidstack offers both h264 and h265 tracks to the viewer if h265 tracks are supported by their browser. H265 can be disabled with SUPPORTS_H265=false.

The following new environment variables have been added:

RESUME_RTMP=true
RESUME_RTMP_ON_UNPUBLUSH=false
RESUME_RTMP_TIMEOUT=3600
SUPPORTS_H265=false
TRANSCODE=4320p60@32000000|2160p60@16000000|1440p60@8000000|1440p30@4000000|720p30@2000000|360p30@1000000|180p30@500000
FLAME_BACKEND=local
FLAME_MAX=1
FLAME_MAX_CONCURRENCY=10
FLAME_IDLE_SHUTDOWN_AFTER=30
#FLAME_MIX_TARGET=nvidia

The following configuration should allow the pipeline to operate how it does today, except with re-connectable RTMP:

RESUME_RTMP=true
RESUME_RTMP_ON_UNPUBLUSH=true
RESUME_RTMP_TIMEOUT=3600
SUPPORTS_H265=false
FLAME_BACKEND=local
zcesur commented 3 weeks ago

Amazing!

Love to see you used FLAME for this. With :syn the pipelines can resume from another FLAME runner/node right?

Can we keep the source rendition in the manifest as well?

lastcanal commented 3 weeks ago

Yes, the pipeline should resume even under FLAME. The RTMP stream connects to an already running application server which will then launch the Pipeline on a FLAME node and forward audio and video messages via RPC. This means that the RTMP stream will be connected to a public-facing application server while the pipeline and transcoding will happen on a non-public-facing FLAME server (with the fly backend). If the streamer reconnects to another application server then syn will find the FLAME node running their pipeline and the stream of RPC messages starts again. I still haven't actually tried this with FLAME's Fly backend, only with multiple nodes running locally; I am going to try to get it running on Fly this week, I also want to try out the hardware transcoding.

As for including the source in the manifest, absolutely! I will add an extra configuration option to allow the un-trans-coded source version to be included in the manifest. It is currently only included if you disable all transcoding by unsetting the TRANSCODE environment variable.

My mornings are much more open this week (I'm UTC-5), I will set up a meeting and can walk you through everything.

zcesur commented 2 weeks ago

This means that the RTMP stream will be connected to a public-facing application server while the pipeline and transcoding will happen on a non-public-facing FLAME server (with the fly backend). If the streamer reconnects to another application server then syn will find the FLAME node running their pipeline and the stream of RPC messages starts again.

That's perfect, can't wait to try this out! Thanks for all the changes :heart_hands:

Should be good to merge once it's tested on prod. Currently I'm getting the error below, not sure where the membrane_h26x_plugin dependency is coming from

 => CACHED [builder 14/17] RUN mix compile 0.0s
 => CACHED [builder 15/17] COPY config/runtime.exs config/ 0.0s
 => CACHED [builder 16/17] COPY rel rel 0.0s
 => ERROR [builder 17/17] RUN mix release 1.9s
------
 > [builder 17/17] RUN mix release:
1.177 * assembling algora-0.1.0 on MIX_ENV=prod
1.177 * using config/runtime.exs to configure the release at runtime
1.881 ** (Mix) Duplicated modules:
1.881   'Elixir.Membrane.H265.Parser' specified in membrane_h265_plugin and membrane_h26x_plugin

using a modified Dockerfile with

ARG BUILDER_IMAGE="hexpm/elixir:1.17.3-erlang-26.2.5.5-debian-bookworm-20241016-slim"
ARG RUNNER_IMAGE="debian:bookworm-20241016-slim"
lastcanal commented 2 weeks ago

I've removed membrane_h265_plugin in favor of the newer membrane_h26x_plugin. That should fix the problem. I've also removed my fork of membrane_rtmp_plugin because my changes got merged upstream!

lastcanal commented 2 weeks ago

I've pushed cdc836c which changes how low-latency HLS partials are served. Currently partial segments are served from the application server; however, with this change partial segments are now uploaded to Tigris and deleted when they are no longer needed. When a partial segment is ready, including after waiting on a X-PRELOAD-HINT, then a 302 redirect to the Tigris bucket is served.

Here is a screencast of the 302 redirect in action

Screencast from 2024-11-11 17-46-42.webm

zcesur commented 2 weeks ago

Just deployed on staging, works great overall!

I think there's a regression with thumbnail generation, can we fix that?

Btw is there anything wrong with triggering toggle_streamer_live on :end_of_stream, :resume_rtmp and init, instead of only on :terminate?

When a partial segment is ready, including after waiting on a X-PRELOAD-HINT, then a 302 redirect to the Tigris bucket is served.

That's awesome! Now that we are serving partials from Tigris, can we eliminate the LLController.broadcast! calls to save bandwidth?

At the moment we spawn a new LLController instance per stream per node and they all cache partials independently, but I think we should be able to get away with a single LLController in the same node as the pipeline that blocks playlist requests and redirects clients to Tigris

How about we create a new branch for the LL-HLS stuff and I'll go ahead and merge this

lastcanal commented 1 week ago

Just deployed on staging, works great overall!

That's great!

I think there's a regression with thumbnail generation, can we fix that?

I will look into this. Most likely related to the latest commit cdc836c.

Btw is there anything wrong with triggering toggle_streamer_live on :end_of_stream, :resume_rtmp and init, instead of only on :terminate?

This only issue is that when toggle_streamer_live(false) is called the manifest url gets changed to Tigris. That part could get split into another function that gets called on terminate.

When a partial segment is ready, including after waiting on a X-PRELOAD-HINT, then a 302 redirect to the Tigris bucket is served.

That's awesome! Now that we are serving partials from Tigris, can we eliminate the LLController.broadcast! calls to save bandwidth?

We still need to continue broadcasting to every instance, but we no longer broadcast any video or audio over the cluster. I am sure it can be cleaned up, but now it only sends and stores the :ready atom in ets

At the moment we spawn a new LLController instance per stream per node and they all cache partials independently, but I think we should be able to get away with a single LLController in the same node as the pipeline that blocks playlist requests and redirects clients to Tigris

I think we will still want a LLController per app server because web clients waiting for preload hints messages would cause a-lot of inter-cluster messages. We could try using https://github.com/discord/manifold , but we would still need a way to distribute and cache manifests on each node.

How about we create a new branch for the LL-HLS stuff and I'll go ahead and merge this

Sounds good! I'll drop cdc836c and push it to a new branch.

zcesur commented 1 week ago

This only issue is that when toggle_streamer_live(false) is called the manifest url gets changed to Tigris. That part could get split into another function that gets called on terminate.

Gotcha, yeah that makes sense

We still need to continue broadcasting to every instance, but we no longer broadcast any video or audio over the cluster. I am sure it can be cleaned up, but now it only sends and stores the :ready atom in ets

Oh I haven't noticed, that's perfect!

I think we will still want a LLController per app server because web clients waiting for preload hints messages would cause a-lot of inter-cluster messages. We could try using https://github.com/discord/manifold , but we would still need a way to distribute and cache manifests on each node.

Agreed, let's keep it that way

I think there's a regression with thumbnail generation, can we fix that?

I will look into this. Most likely related to the latest commit cdc836c

Looks like the pattern never matches because segment_sn is obsolete, we need to match on sequences: %{}