finos / SymphonyMediaBridge

The Symphony Media Bridge (SMB) is a media server application that handles audio, video and screen sharing media streams in an RTC conference system.
Apache License 2.0
54 stars 18 forks source link

MEDIA-2661: ICE reconnects during meeting and fails #377

Closed RicardoMDomingues closed 1 week ago

RicardoMDomingues commented 2 weeks ago

Problem

We've had some issues with users complaining that they can't hear anyone, but are heard by others. After analyzing the logs we noticed that there are some problems when SMB tries to nominate a new ICE candidate that sometimes fails and let the ICE state in FAILED on SMB, while clients have a different ICE state which cause the clients to continue in the meeting without trigger the recover (in Symphony this process creates a completely new session that takes over the older one).

We observed that in one of the cases, client presented 1 IPv6 candidate and 1 IPv4. SMB nominates IPv6 and discard IPv4 PRFLX candidate after CONNECTED state. Then something happened with IPv6 and SMB changed the ICE state to CONNECTING trying to select a new candidate and it failed after 15 seconds. Although it was not nominated a new candidate, client logs shows that WebRTC decided to use IPv4 candidate to send as the IPv6 was not reachable. The problem on IPv6 seems to had been a temporary glitch and clients logs shows that the client continued to receive media from SMB through the IPv6 after this event (but SMB was already on FAILED state), that works for a few minutes until it stops to work (perhaps client firewall close the port due to lack of outbound traffic through the IPv6) and client stops to receive media while it continued to send through the IPv4.

This is caused by this issues:

  1. SMB stills to respond to client ICE requests when the state is FAILED. This could mean that clients do not fail the ICE on their side, leaving the SMB and clients with different views of the ICE state.
  2. After ICE state goes to CONNECTED, SMB discards good PRFLX candidates that could potential be used as backup if something happens with the nominated.
  3. After 15 seconds (reflexiveProbeTimeout) of being in CONNECTED state, SMB stops to accept new PRFLX candidates. SMB can accept PRFLX candidates again if it goes to CONNECTING state but it has only 15 seconds to see the PRFLX candidates before fail. WebRTC client seems to be quick on detection of ICE problems and they can start probe all backups candidates again before SMB goes to CONNECTING and spend the next 25 seconds without probing them again, causing SMB to fail before see the PRFLX candidates.

Solution

With this change we stop to respond to ICE STUN request when the ICE state on SMB is FAILED. SMB does not have any mechanism to report the ICE state to the components that manage the signalling with clients nor it has a way to restart the ICE (which involves a new SDP offer-answer negation with a different ICE username). Then, the best decision SMB can take is stop to respond to ICE STUN request which will cause the ICE state on client to go to FAILED as well and then clients can trigger the recover on their end.

Also we don't discard good PRFLX candidates only because they were not nominated. They will be kept as backup candidates if the nominated becomes unreachable. This relies on the fact that the main WebRTC implementation (Google implementation) keeps probing the backup candidates each 25 seconds, preventing to NAT/firewall to recycle the port while SMB has the candidates pair on frozen state.

We also accept candidates anytime. This is not required as the initial problem was discarding backup candidates. So if we keep the backup candidates, then we could stop accept new candidates after a few seconds as the ICE gathering will complete and no new candidates could be gather according to the standard. BUT, although NOT standardized, Google WebRTC allows to configure a "GATHER_CONTINUALLY" (at least via native lib, I am not sure if it possible via browser), GATHER_CONTINUALLY makes the gathering to never finish, this can be very useful for mobile phones where it is relatively easy to start a call on mobile data and to connect to a WiFi network later on, or vice-verse. We are exploring ways to improve mobile user experience in Symphony and I am eager to enable this configuration on mobile.

Future improvements

With GATHER_CONTINUALLY enabled, other non standardized mechanism Google implements is ICE renomination. Unlike GATHER_CONTINUALLY, which is a WebRTC peer connection configuration that seems to be accessible only through the native WebRTC, ICE renomination seems to be possible to be negotiated via SDP. There is a draft for the renomination https://datatracker.ietf.org/doc/html/draft-thatcher-ice-renomination-00 and it seems to be very useful together with GATHER_CONTINUALLY.

Other Google STUN attribute to explore is STUN_ATTR_GOOG_NETWORK_INFO this give us information about the network and it seems to be possible to infer if candidate is from 2G/3G/4G/5G, WIFI, ETHERNET and even if is using a VPN. Together with CONTINUALLY_GATHERING + ICE renomination, this attribute can be used to quick change to a better candidate once it is available