SSR-BRS crashes after varying amount of time

janhenhan commented 1 year ago

Hi all,

Great to see you guys are still going strong developing SSR after over a decade. Congrats on the 0.6 release!

Recently, I've increased the number of network messages I send to ssr-brs (As an example, let's say 20 sources each get messages updating some of their attributes at 100 Hz update rate). Unfortunately that came with a big decrease of stability of the ssr.

I am experiencing some unexpected crashes after varying amounts of time - sometimes it runs fine for hours, other times only minutes. At first I thought this is maybe the older FUDI interface's fault (seeing some open issues here describing similar crashes using the older network interface), so I switched over to using the more recent websocket interface. Unfortunately, same problem with crashes there. The messages I send all seem to contain values within a valid range, i.e. it is no particular message that crashes ssr-brs as far as I can tell.

I attached the process to lldb, however the messages mean very little to me - most of the time it is a bad access in the cleanup: " Process 45934 stopped

thread # 13, stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8ca818) frame # 0: 0x000000010000d2e4 ssr-brs`apf::CommandQueue::push(apf::CommandQueue::Command) [inlined] apf::CommandQueue::_cleanup(this=0x0000000100110588, cmd=0x000060000021a800) at commandqueue.h:173:12 [opt] 170 void _cleanup(Command cmd) 171 { 172 assert(cmd != nullptr); -> 173 cmd->cleanup(); 174 delete cmd; 175 } 176 Target 0: (ssr-brs) stopped. "

Any thoughts on what this means or how I could prevent it, to get ssr-brs to a more robust state again? These bad_accesses happen somewhere in APF? Any other logs that would help? I'm on a M1 Mac.

Many thanks!

mgeier commented 1 year ago

Thanks for the report!

This sounds like a nasty bug, I hope we can find the cause and fix it.

It kinda sounds like a use-after-free bug where the cmd pointer is accessed after it has been freed somewhere else. However, it is freed literally in the next line, and not somewhere else ...

Smells a bit like undefined behavior ...

These bad_accesses happen somewhere in APF?

Well, yes, the CommandQueue is used to send messages from the control thread to the audio thread (and back). It might be a problem in the APF, but not necessarily.

Any other logs that would help?

I don't know. It seems the problem happens when calling the cleanup() function, but before this function is actually executed.

I'm on a M1 Mac.

That's a good hint. I have the feeling that our ring buffer implementation might not be correct on ARM processors.

Are you running the SSR natively or via Rosetta?

The first thing I would try is to use atomics in our ring buffer and see if that changes anything. Currently, I don't have a lot of time, but maybe I can try a few things next week.

janhenhan commented 1 year ago

Thanks Matthias! It would be really great if you can find the time to have a look at some point :)

Are you running the SSR natively or via Rosetta? I'm running a native M1 arm build.

For what it is worth, a maybe questionable observation I have made is that SSR seems to crash much quicker when I start it as a subprocess in python compared to when I wait for it to crash in the debugger... But that may just be subjective or within the range of the very varying times it runs until it crashes.

SoundScapeRenderer / ssr

SSR-BRS crashes after varying amount of time #371