coturn / coturn

coturn TURN server project
Other
11.1k stars 2k forks source link

CoTURN scalbility with multiple instances #1561

Open dsatizabal opened 3 weeks ago

dsatizabal commented 3 weeks ago

Hello all!

We're starting implementation of a media project to stream some video sources to one or more peers via WebRTC, we're planning on using CoTURN for this, we already have a PoC working with the server on a an Azure VM and some node JS code for signaling API and front end.

Now, the system in production is likely to have something between 100K and 1M sessions, so we want this to be elastically scalable of course taking care of the costs, so our first take is to use AKS to run single instances of CoTURN on each worker node and scale according to the load it handles. As per the documentation I know that:

When used as a part of an ICE solution, for VoIP connectivity, this TURN server can handle thousands simultaneous calls per CPU (when TURN protocol is used) or tens of thousands calls when only STUN protocol is used

So we believe that with a single node and a medium-range VM for that node we're likely to reach 100K, now, the problem is that when we create a new worker node then we'd have separate CoTURN instances and there's no warranty that, under certain circumstances, peers trying to establish a connection will fall into the same server/session, not to mention that we'll have to deal with TCP/UDP support issues.

Now, I also see in the documentation that:

For virtually unlimited scalability a load balancing scheme can be used. The load balancing can be implemented with the following tools (either one or a combination of them):

- DNS SRV based load balancing; - built-in 300 ALTERNATE-SERVER mechanism (requires 300 response support by the TURN client); - network load-balancer server.

my particular question here is: is there documentation or resources we can check to implement this unlimited scalability scheme? I've been researching and seems like CoTURN is not meant to be placed behind a Load Balancer as it is kind of an additional networking device and not application layer, so I'm kind of confuse on how to design my solution for the given requirements.

Any comment/help is much appreciated in advance

Thanks!

PD: if the solution would be change the AKS approach to use VMs we'd also may embrace it as long as it's an stable, secure, scalable and maintainable solution

jonesmz commented 3 weeks ago

There is no reason to involve a TURN server in this situation.

So long as your webrtc endpoint hosting the video is in the cloud and can always either:

  1. provide host candidates
  2. Can reach google's stun servers to get srflx candidates

then involving coturn in your architecture is simply adding a point of failure, extra cost, and extra complexity for literally no benefit.

dsatizabal commented 3 weeks ago

Sorry @jonesmz I do not understand why you're saying we do not need a TURN, the vide sources we have aren't exactly on the cloud and we require the ability to use NAT traversing to be able to establish communication even if one peer's network does not allow peer-to-peer communication. Also, we do not want to rely on third-parties for this so using Google's service is not an option we're considering, we have decent infrastructure deployed/maintained and that's why we'd rather use our own services.

Thanks anyway for your comment

jonesmz commented 3 weeks ago

Are you transmitting prerecorded video, or are you doing live video?

If live, then nevermind.

If prerecorded, why host this behind NAT?

Anyway, for scaling, I recommend using DNS roundrobin. When your coturn hosting VM comes online, make it register itself with your DNS as an additional address. When the VM is ready to shutdown, deregister the instance and wait for all active connections to terminate (e.g. you can apply this patch: https://github.com/coturn/coturn/pull/1529 ) and then shutdown.

I recommend disallowing WebRTC over TCP or TLS, they introduce substantial quality issues into the video and audio.

Re: load balancer -- no load balancers understand the TURN protocol, which is inherently not load balancer friendly. You are better off not attempting to funnel incoming traffic over a load balancer, but can use one for the auto scale up/down functionality.

dsatizabal commented 3 weeks ago

Yes the video is mostly live, and indeed we're exploring using DNS SRV for Load Balancing.

I'll keep you posted, thanks!