[bug] SendQueue.cs _listPool.Get() sometimes contains a null value

maasaimosh commented 5 years ago

Context

Long running voip session ( multiple hours ) gets NRE in SendQueue.cs

Expected Behavior

Long running voip session ( multiple hours ) does not produce exceptions

Actual Behavior

NullReferenceException in SendQueue

Example Stacktrace:

<i>AndroidPlayer(Amazon_XXXXI@192.168.xxx.xxx)</i> [Dissonance:Recording] (22:40:46.138) BasePreprocessingPipeline: Microphone subscriber 'EncoderPipeline' threw: System.NullReferenceException: Object reference not set to an instance of an object
  at Dissonance.Networking.Client.SendQueue`1[XxxxPeerInfo].EnqueueP2P (UInt16 localId, ICollection`1 destinations, ICollection`1 queue, ArraySegment`1 packet) [0x00057] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Networking\Client\SendQueue.cs:169
  at Dissonance.Networking.Client.SendQueue`1[XxxxPeerInfo].EnqueueUnreliableP2P (UInt16 localId, IList`1 destinations, ArraySegment`1 packet) [0x00033] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Networking\Client\SendQueue.cs:154
  at Dissonance.Networking.Client.VoiceSender`1[XxxxPeerInfo].Send (ArraySegment`1 encodedAudio) [0x00190] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Networking\Client\VoiceSender.cs:325
  at Dissonance.Networking.BaseClient`3[XxxxxServer,XxxxxClient,XxxxPeerInfo].SendVoiceData (ArraySegment`1 encodedAudio) [0x00008] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Networking\BaseClient.cs:229
  at Dissonance.Networking.BaseCommsNetwork`5[XxxxxServer,XxxxxClient,XxxxPeerInfo,Dissonance.Unit,Dissonance.Unit].SendVoice (ArraySegment`1 data) [0x00021] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Networking\BaseCommsNetwork.cs:486
  at Dissonance.Audio.Capture.EncoderPipeline.EncodeFrames (IVoiceEncoder encoder, Int32 maxCount) [0x00037] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Audio\Capture\EncoderPipeline.cs:130
  at Dissonance.Audio.Capture.EncoderPipeline.ReceiveMicrophoneData (ArraySegment`1 inputSamples, NAudio.Wave.WaveFormat format) [0x000d1] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Audio\Capture\EncoderPipeline.cs:105
  at Dissonance.Audio.Capture.BasePreprocessingPipeline.SendSamplesToSubscribers (System.Single[] buffer) [0x000a0] in PROJECT_PATH:\Assets\Plugins\Dissonance\Core\Audio\Capture\BasePreprocessingPipeline.cs:330

Source code line:

        //Copy destinations into a new list we're allowed to mutate
        var dests = _listPool.Get();    // line 168... _listPool.Get() returns null
        dests.Clear();             //<---- dests is null :(

In one long testrun, this error occurred 4 times.

Note the timestamps: 22:40:46.138 , 23:05:35.710 , 23:34:50.739 , 00:39:42.481

<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (22:40:46.138) BasePreprocessingPipeline: Microphone subscriber 'EncoderPipeline' threw: System.NullReferenceException: Object reference not set to an instance of an object
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (23:05:35.710) BasePreprocessingPipeline: Microphone subscriber 'EncoderPipeline' threw: System.NullReferenceException: Object reference not set to an instance of an object
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (23:34:50.739) BasePreprocessingPipeline: Microphone subscriber 'EncoderPipeline' threw: System.NullReferenceException: Object reference not set to an instance of an object
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (00:39:42.481) BasePreprocessingPipeline: Microphone subscriber 'EncoderPipeline' threw: System.NullReferenceException: Object reference not set to an instance of an object

After the 4th error, there developed a follow on problem:

Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C

<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:29.407) EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (40 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C
---
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:29.749) EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (50 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C
---
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:29.927) SpeechSession: Beginning playback with very large network jitter: 0.9286258s 1confidence
---
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:31.445) EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (60 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C
---
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:31.905) EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (70 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C
---
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:32.323) EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (80 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C
---
<i>AndroidPlayer(Amazon_XXXX@192.168.xxx.xxx)</i> [Dissonance:Playback] (00:45:35.109) EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (80 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C

The warning: EncodedAudioBuffer: Voice Error: Encoded audio heap is getting very large (80 items)! Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C

then just kept being repeated and repeated in the logs

Workaround

None found

Steps to Reproduce

Provide a detailed set of steps to reproduce the problem

Have an android voip application connected to a PC host
Automate the sending of audio over the session both directions
Leave the session going for a few hours
Experience the problem every so often

Your Environment

Amazon fire tablet 7 with Dissonance 6.4.2 Home written networking stack that works PC application to run the Dissonance server also with the homewritten stack

windows 10 Unity 2017.4.28

Dissonance version used: e.g. v1.0.0 (check Assets/Dissonance/changelog)

6.4.2

Unity version: e.g. 5.3.5f1 Personal (check Help > About Unity)

2017.4.28

Editor Operating System and version: e.g. Microsoft Windows 10 Pro 10.0.10586 (x64) (Depends on your OS, for windows it's in the "System Information" program)

Windows 10 x64

Build Settings: Tell us the build settings/platforms you have encountered this issue with e.g. Windows, x86_64, Development Build

Development build deployed to Android tablet, then consol logs captured via usb cable into Unity.

martindevans commented 5 years ago

Does your custom networking system use multiple threads? This is one of those cases where the return value can't possibly be null, so I suspect multithreading corruption of the data structure.

If you do use multithreading, make sure you're not calling any Dissonance methods (e.g. delivering packets) off the main thread.

maasaimosh commented 5 years ago

It does use exactly one thread. I stopped using the ForgeRemastered networking stack when I encountered the other multi-threading problems a few weeks ago ( #162 ]

However to diagnose, I will take the thread tracking code I put in to debug #162 to this codepath too.

Also as mentioned, this is running on an Android tablet which I have less experience with.

I will post the results here.

martindevans commented 5 years ago

The FNR multithreading bugs have been fixed, so if that was the only reason you moved off FNR you could migrate back to that now.

The code you put in place to monitor for multiple threads last time was good, adding it back in is a good idea. If that does get hit my multiple threads try capturing a stacktrace from each unique thread and seeing where they're coming from.

The fact that this is so incredibly rare (requiring several hours in a system handling 40-60 packets per second) makes me think there might be something weirder going on here, although I can't think what.

maasaimosh commented 5 years ago

Okay... the first rush is in...

BasePreprocessingPipeline is created many times, creating many threads. This is on the Dissonance side.

I think it's something microphone related. I get some restarting messages to do with the microphone and the code trace leads to the suspect CapturePipelineManager.RestartTransmissionPipeline()

The way you can verify this is change the recording device in the Basic Microphone Capture Script

I put in some testing logging.

   protected BasePreprocessingPipeline([NotNull] WaveFormat inputFormat, int intermediateFrameSize, int intermediateSampleRate, int outputFrameSize, int outputSampleRate)
    {
        Log.Error("BasePreprocessingPipeline.cs constructor called. this should happen once");

and

    private void ThreadEntry()
    {
        try
        {
            Log.Error( "BasePreprocessingPipeline.ThreadEntry() start called. this should happen once" );

martindevans commented 5 years ago

Do you have multiple preprocessing pipelines active at once? That shouldn't happen. Could you send me a log, I'll see what those messages are and what the root cause of multiple preprocessors being created is.

However, I don't think that can cause this issue. The voice sending is designed to handle multithreading - voice preprocessing/encoding runs on a background thread and then the packet is put onto a queue and sent by the network system next frame (on the main thread). If you trace through to Assets/Plugins/Dissonance/Core/Networking/Client/VoiceSender line 300 you'll see that there's a lock protecting the critical datastructures. At the very least this means that two preprocessors that exist at the same time could not access that section at the same time.

maasaimosh commented 5 years ago

I've had a further look this evening after work. New threads are spawned all the time. And your threads are definitely clashing.

1/2 ) Pool.Get() needs protection

The BasePreprocessingPipeline.ThreadEntry() loop only checks _runThread at the start of the while loop. So old threads are still somewhere in the innards of the ThreadEntry while loop when a new Thread is started and catches up.

The SendQueuePool._listPool.Get() call is not covered by a lock. So the code

        /// Get an item from this pool
        public T Get()
        {
            if (_items.Count > 0)     // Thread A and B both read a 1
                return _items.Pop();  // one of Thread A or Thread B get the last entry... Thread Other returns null

The Count and the Pop should be protected in a lock( _items ) group

            public T Get()
            {
                lock( _items )
                {
                    if (_items.Count > 0)
                        return _items.Pop();

                    return _factory();
                }
            }

and also Put() for safety ( less important ).

2/2 ) Why so many threads ? Just an implementation question ?

I guess I've got to ask why so many threads are being spawned as well.

During my testing this is happening all the time. It seems a frame skip will 'force capture pipeline reset'. This makes a new thread.

Here is a transcript where everything blows up after a frame skip and new thread:

<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (20:38:19.638) CapturePipelineManager: Detected a frame skip, forcing capture pipeline reset (Delta Time:0.1673068)
<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (20:38:19.824) BasicMicrophoneCapture: Began mic capture (SampleRate:44100Hz, FrameSize:882, Buffer Limit:2^13, Latency:20ms, Device:'')
<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (20:38:19.848) BasePreprocessingPipeline: BasePreprocessingPipeline.cs constructor called. this should happen once
<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (20:38:19.818) BasePreprocessingPipeline: BasePreprocessingPipeline.ThreadEntry() ended: 7
<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Recording] (20:38:19.871) BasePreprocessingPipeline: BasePreprocessingPipeline.ThreadEntry() started: 8
[Dissonance:Playback] (20:38:20.431) EncodedAudioBuffer: Error: Encoded audio heap is getting very large (50 items)! This is probably a bug in Dissonance, we're sorry! Please report the bug on the issue tracker "https://github.com/Placeholder-Software/Dissonance/issues". You could also seek help on the community at "http://placeholder-software.co.uk/dissonance/community" to get help for a temporary workaround. Error ID: 59EE0102-FF75-467A-A50D-00BF670E9B8C
<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Network] (20:38:20.202) SendQueue`1: ServerRelay.ProcessPacketRelay(): New thread id encountered: 8
<i>AndroidPlayer(Amazon_XXXXX@192.168.xxx.xxx)</i> [Dissonance:Network] (20:38:20.204) SendQueue`1: SendQueue.EnqueueP2P(): Switch thread id [ 7 ] to [ 8 ]
Stack Trace:
   at Dissonance.Networking.Client.SendQueue`1[[PeerInfo].EnqueueP2P(UInt16 localId, ICollection`1 destinations, ICollection`1 queue, ArraySegment`1 packet)

martindevans commented 5 years ago

1/1

The synchronisation happens at a higher level than you're looking in the CapturePipelineManager (Assets/Plugins/Dissonance/Core/Audio/Capture/CapturePipelineManager.cs) which is responsible for managing pipelines and their life time. When a full reset happens it stops the current capture pipeline, joins on the thread and then starts a new pipeline (and a new thread).

That's what I was saying before about how there shouldn't be two threads running in this code at the same time - have you actually caught two preprocessor instances running concurrently? If so then that's definitely a bug that needs fixing!

Edit: As a quick fix, you could try swapping out the _listPool from a Pool<T> to a ConcurrentPool<T>.

2/2

In a perfect situation Dissonance spawns just one thread to run preprocessing/encoding.

However, the Unity microphone API requires that we capture mic input on the main thread so the mic input is quite sensitive to long frames - too slow and the audio system starves for input. The frame skip detector is a last ditch attempt to recover from overwhelmingly bad frame times (e.g. a single frame took 167ms in that log sample), it tears down the entire pipeline and recreates it in a fresh state (which spawns a new thread).

Of course this should happen quite rarely, do you have a lot of these frame skip messages?

Edit: If this is the cause then this would probably explain why it happens so rarely (resets happen relatively rarely compared to how often a single voice packet is sent).

martindevans commented 5 years ago

As a precaution I've rewritten SendQueue.cs (Assets/Plugins/Dissonance/Core/Networking/Client/SendQueue.cs) to use locked values around all of the non-threadsafe structures, this shouldn't be needed but I thought you may find it useful to try this. If it solves your issue then I'll at least have an idea where to look for the root problem (and you'll be able to get on with building your app!).

https://gist.github.com/martindevans/25783517dba4ab2d31283be96c0d0e6b

martindevans commented 5 years ago

With Dissonance 6.4.3 I hardened the networking system against multithreading bugs, hopefully this may mitigate your issue :)

maasaimosh commented 5 years ago

Okay... I will upgrade and evaluate

maasaimosh commented 5 years ago

I have left the 6.4.3 codebase running now for a long time. This problem has not re-occurred, so I reckon it's closed.

Thanks for the changes !

martindevans commented 5 years ago

Fantastic, thanks for confirming that :)

Placeholder-Software / Dissonance