Placeholder-Software / Dissonance

Unity Voice Chat Asset
71 stars 5 forks source link

What is the jitter algorithm based on? [help] #83

Closed sayangel closed 6 years ago

sayangel commented 6 years ago

We recently had a case where someone had really bad audio quality so we investigated why it was sounding so bad. After messing with some of the jitter parameters it seems like we can improve the audio quality at the expense of lag accumulation.

I'm wondering if there's a reference for the jitter algorithm that Dissonance is using in particular this line of code in SpeechSession.cs

//Calculate how much we should be delayed based purely on the jitter measurement
var jitterDelay = _jitter.Jitter * 2.5f * _jitter.Confidence + InitialBufferDelay * (1 - _jitter.Confidence);

I understand the estimated delay starts at 100ms and then adjusts. What exactly does the jitterDelay variable represent? We were seeing values reach about 1.0 and this is when audio quality would get completely butchered. Hard limiting the jitterDelay value to .1 or 100ms then the audio stream would sound clear, but it would be delayed from the other person talking.

Any direction to understand the jitter compensation process would be appreciated. Happy to provide samples of the bad audio via email just send me an email to angel@insitevr.com

martindevans commented 6 years ago

I typed all of this up and then noticed you're using an older version of Dissonance - the code snippet you show was changed in Sept 2017. Those changes included some fixes to playback quality in the face of terrible network conditions which seem like they could be relevant to you. I'd suggest upgrading to the latest Dissonance and seeing if that fixes the problem.

In case that does not fix the problem (or you're just curious) here's my original reply:

Jitter buffering protects against changes in latency between one packet and the next (aka jitter). If we played out a packet as soon as it arrived the next packet would have to arrive exactly on time, any delay at all and it would be too late because we must supply audio to the speakers on time. A jitter buffer deliberately delays packets slightly (worsens latency) to ensure there's always a packet available when needed (improves quality). In Dissonance packets are added to a buffer as soon as they arrive and a timer is started, after some delay (the _jitterDelay value) playback begins. Since playback and capture are both operating at the same rate the buffer should then keep approximately the same number of packets in it at all times, only varying due to jitter in packet delivery/playback time. We also have a mechanism in the playback system which slightly changes playback speed (within a few percent) to deliberately expand or shrink the buffer as needed (so if jitter gets worse as playback proceeds the buffer can slowly be expanded to add more delay, and tolerate the extra jitter).

What exactly does the jitterDelay variable represent?

On the line you highlighted:

We were seeing values reach about 1.0 and this is when audio quality would get completely butchered. Hard limiting the jitterDelay value to .1 or 100ms then the audio stream would sound clear, but it would be delayed from the other person talking.

For a value of 1 second the jitter measurement must be measuring 400ms standard deviation in packet delay times (which is truly dreadful) - can you tell if this is a real network problem or if the the jitter meter is completely broken?

Were you getting any other warnings when playback was bad? For example if the delay gets very large there's a message Encoded audio heap is getting very large (N items) printed out (threshold is set at 40+ items, or about 1.6 seconds of delay at the default settings). If you could send me some examples (martin@placeholder-software.co.uk) of bad audio that'd be handy. If you can get a log that'd be fantastic.

martindevans commented 6 years ago

I had a listen to the audio sample you sent me. It sounds exactly like I would expect very bad network conditions to sound - packets are being dropped/lost and packet loss concealment is making up something to fill the gap (that's why it sounds muddy). There are no audio glitches (pops/clicks) which would indicate a non-network related audio problem.

First thing to try is definitely to upgrade to the latest Dissonance version to see if those changes I mentioned mitigate the issue.

sayangel commented 6 years ago

I updated to 6.0.2 from the asset store and still seeing bad quality under stressed network conditions. For reference I'm using clumsy https://jagt.github.io/clumsy/ to simulate network conditions.

martindevans commented 6 years ago

We use clumsy to test networking too. What settings are you using? I'll see if I can reproduce the problem.

For reference:

martindevans commented 6 years ago

If you're still interested this issue check out this other issue: https://github.com/Placeholder-Software/Dissonance/issues/87#issuecomment-378182718

It turns out the server has been relaying unreliable packets (i.e. voice) with a reliable connection. This will make Dissonance worse at handling lost packets because subsequent packets will be delayed while the network re-sends the lost packet, causing a whole bunch of packets to be lost because when they eventually arrive they're all too late. Try changing the parameter on line 427 of BaseClient.cs from true to false:

writer.WriteRelay(_serverNegotiator.SessionId, destinations, packet, FALSE);
sayangel commented 6 years ago

Woah - that makes a lot of sense as to why we'd see that behavior. Thanks for the update!

sayangel commented 6 years ago

Hey Martin - it didn't make too much of a difference. At least nothing night and day.

-Angel

martindevans commented 6 years ago

What kind of clumsy settings are you using?

martindevans commented 6 years ago

Tom just merged a PR of mine which may help with this issue - I've enabled Forward Error Correction (FEC) with the Opus codec. When elevated packet loss rates are detected this encodes a low quality copy of the previous frame into each packet, if the decoder comes to decode a frame and the current packet isn't here yet it uses the next packet (if available) to extract the low quality version of the frame instead, otherwise it falls back to Packet Loss Concealment (PLC).

Testing with the built in packet loss simulation (Window > Dissonance > Diagnostics) I could understand speech up to about 30% packet loss rates which is absolutely incredible! This isn't a perfect test, since packets are rarely lost on a purely random basis, but it's a significant improvement on before.

This will available in the next release of Dissonance :)

martindevans commented 6 years ago

Dissonance 6.2.0 has just been submitted to the asset store, this includes the FEC fix I mentioned above. This should be available in a few days once the asset store team reviews it :)

martindevans commented 6 years ago

Dissonance 6.2.0 is now available on the asset store so I'll close this issue now. Don't hesitate to re-open it if there's still a problem :)