SoundScapeRenderer / ssr

Main source code repository for the SoundScape Renderer
http://spatialaudio.net/ssr/
GNU General Public License v3.0
134 stars 53 forks source link

implementation of OPSI renderer #82

Open rowalz opened 7 years ago

rowalz commented 7 years ago

For a comparision and evaulation of different rendering algorithms I wanted to implement OPSI renderer (Wittek 2002) to SSR.

The idea is to base it on the WFS renderer (for a start I copied the complete wfsrenderer.h) and to filter the resulting loudspeaker signals with a lowpass biquad (using the apf::Biquad-class). The two speakers which are located on the connecting line between virtual source and reference point do net get filtered but panned/weighted with an algorithm similar to VBAP (I just took some usefull parts from the vbaprenderer.h)

Because the filtering is applied after the rendering process I tried to implement it in the OpsiRenderer::RenderFunction-class in sample_type operator()(sample_type in). I am not completely sure if it makes sence. However it works and I get the lowpass filtered signal for all of the speakers except the ones which I described before. But there are some very ugly and annoying artefacts like dropouts (up to a few seconds) and crackling. To me it seems there is a offset of n-times buffersize (1024 samples) between the appearance of artefacts but I do not really understand where the buffer is applied.

I could supply the code. Is there a straightforward method to do this or should I just upload it here?

mgeier commented 7 years ago

I could supply the code. Is there a straightforward method to do this or should I just upload it here?

The canonical method would be:

rowalz commented 7 years ago

Thanks for the assistance. Here is the link to my fork with branches dbaprenderer and opsirenderer: https://github.com/rowalz/ssr. opsirenderer is the one I was talking about.

rowalz commented 7 years ago

Here is the crackling sound (with a sinus 1 kHz). I suppose it originates of the first few samples in every buffer which do not have the samples x[n-1] (which is called in_old_1), x[n-2], y[n-1] and y[n-2] (out_old_2) but only x[n] (in) and y[n] (out_current). I tried to fix it by moving the definition of these variables. But everything I've done so far to fix it, either led to a segmentation fault or to compiler errors. missing_samples

And here is an example of what I called dropouts - it is not really a dropout though (not anymore at least), just a difference of amplitude periodically with the buffersize (every n-times 1024 samples). It originates somehow from the panning algorithm (currently commented out).

dropouts

JensAhrens commented 7 years ago

A few conceptual comments from my side (sorry, I haven't looked into the code...):

It will probably be more efficient to do the band splitting (highs and lows) before the rendering as there will be fewer channels to filter. But that's not critical, it should work either way. My spontaneous approach would be splitting the signal and them having a VBAP and a WFS renderer in parallel the outputs of which I would connect to the loudspeakers.

The tricky part is that you need to secure that both the WFS part and the VBAP-like part cause the exact same delay of the signal. The VBAP renderer causes no algorithmic delay. It only applies gains to the individual loudspeaker signals (this costs one audio frame in terms of time delay).

The WFS part is trickier. Firstly, it applies what we termed predelay (http://ssr.readthedocs.io/en/latest/renderers.html?highlight=predelay#wave-field-synthesis-renderer). This is a delay that is applied to everything by default. This is necessary because the rendering of focused sources (virtual sound sources in front of the loudspeakers) require "anticipation" of the signal (i.e. a negative delay). The predelay provides the headroom for that. It can be adjusted as explained in the docs.

Secondly, WFS inherently requires delaying. If I remember correctly, the current implementation of the WFS renderer even takes the propagation time of sound in the virtual space into account. I.e., if a virtual source is 10 m behind the loudspeakers, SSR will apply a delay equivalent to the time sound takes to travel 10 m. We have been wanting to provide the option to remove that. In other words, one could subtract the shortest delay that occurs for a specific virtual sound source from all driving signals of that source so that there will always be a loudspeaker with no delay just like in VBAP. I've once started to implement this. But I hadn't understood the multi-threading architecture well enough to come up with a proper solution, and I gave up.

Finally, the WFS prefilter adds a bit a delay. By default, SSR uses a linear-phase FIR filter, the delay cause by which is half its length. The default filters are 128 taps long, I think. So the delay caused by them will be 64 samples (actually, it will be 64.5 if I interpret it correctly).

mgeier commented 7 years ago

First of all, sorry for the horrible API. I'm also having a hard time understanding it, even though I came up with it myself.

I had a quick look at your code, but I think it makes more sense if I talk about the existing WFS and VBAP renderers before commenting on your code.

The only thing I want to say up front is: please don't use non-english variable names and don't write non-english comments!

There are of course many ways to tackle implementing an OPSI renderer. As mentioned above, one of them would be to split the input signals, do some filtering and then use the existing WFS and VBAP renderers as "black boxes" side by side (and add the two results in the end). This might work (or not?), but I think it's better to intermingle them, since, as Jens said above, you'll need to know the WFS delays and apply them to the VBAP part.

General architecture

Let me try to shed some light onto the architecture and its ugly API: Each MimoProcessor works in different "stages". Each "stage" consists of a "list" of "items". The processing within one "stage" is typically happening in parallel (the "items" of one "list" are potentially processed in parallel), but all threads wait for each other between "stages".

By default, each MimoProcessor has two "stages" (and corresponding "lists"): Input and Output. Their processing happens automatically, but derived classes can define their own nested Input and Output classes and can implement some processing there.

All SSR renderers have an additional "stage": Source. The corresponding "list" is called _source_list. Each renderer has to take care to trigger the processing of this list at the right time. You'll find a corresponding line in each renderer.

Most SSR renderers have only the three aforementioned "stages", if you want to see a more complicated one, have a look at the NFC-HOA renderer. But both WFS and VBAP renderer have three "stages":

You'll notice that the WFS renderer has another peculiar class named SourceChannel and the renderer itself is awkwardly derived from SourceToOutput, which is defined in src/rendererbase.h. I agree that those are horrible names and since there is no documentation it's basically impossible to understand what that's all about. But let me try to explain: Each Source instance contains a list of SourceChannel objects, one for each Output. You can think of the SourceChannel as a "link" between a Source and an Output. This connection is made automatically whenever a new Source is created. You only have to know there is a sourcechannels list in each Source (containing one pointer-to-SourceChannel for each Output) and there is a sourcechannels list in each Output (one pointer for each Source).

Now let's look at the two relevant renderers and check we need to know for the OPSI renderer, shall we?

The WFS renderer

I'm just going through src/wfsrenderer.h here ...

The main WfsRenderer class most notably contains the WFS pre-filter coefficients (which are transformed to the frequency domain in the very beginning). In our WFS implementation that filter never changes and the same instance is re-used all the time. There is also some information about the cross-fade and the settings for the maximum delay and the initial delay that Jens already mentioned above.

The main class also takes care of processing the _source_list, as mentioned above. But that's it, the real work is done in the sub-classes.

WfsRenderer::Input

The Input class contains a convolver and a delay line. And here, the first interesting processing happens: The incoming audio data is passed to the convolver, the convolver convolves and the result is written to the delay line. And here we also have the first problem: in the OPSI renderer the pre-filter should only be applied to the WFS part, not to the VBAP part. I think it's best to remove the filtering here and put the unfiltered signal into the delay line. For WFS, it doesn't matter where the filtering is done, you can do it later (but don't forget!).

WfsRenderer::SourceChannel

As mentioned above, the SourceChannel is kind of a "link" between sources and outputs. It contains a reference to its Source. And it contains variables for the weighting factor and the delay that's relevant for the corresponding source/output combination. It doesn't do any processing on its own, but it defines an update() function that will be called by somebody else (see below).

WfsRenderer::RenderFunction

This will be called by each Output. It defines in fact four functions to decide these four things:

The two in the middle are trivial, just a simple multiplication. And between fade out and fade in, the update() function is called for each SourceChannel. And this update() function just adjusts the delay that is used to read from the delay line. The select() function, however, does much more than merely selecting if a given SourceChannel should be used or not. It actually does most of the WFS-specific computations. Given a single SourceChannel, it calcucates the weight and delay that's supposed to be applied. You can simply re-use this for the OPSI renderer, but it would be nice to encapsulate that into a function of some sort instead of copy-pasting the code. Note that the RenderFunction is only defined here, it wasn't called yet ...

WfsRenderer::Output

Let's skip over this for a second ...

WfsRenderer::Source

The Source has a reference to the delay line of its Input. You can ignore get_output_levels() for now, that's not essential. The important thing is the processing, which is done in _process(). The only thing that is done here, is to check if the current source is focused or not. To do that, we have to iterate over all loudspeakers. We can't do this later when we iterate over the outputs anyway. We have to iterate twice. Again, you can re-use this for the OPSI renderer.

WfsRenderer::Output

OK, back to the Output. This is where it all comes together. It doesn't look like a lot, but it kicks off all the stuff from the RenderFunction and neatly cross-fades everything as necessary. Under the covers, this iterates over all sourcechannels, checks which of them are relevant (using select()), applies some weighting and crossfading and in-between updates the time where we read from the delay line (using update()).

The VBAP renderer

... coming soon ...

I'm making a break here, but I will continue within the next few days. Feel free to ask new questions in the meantime!

mgeier commented 7 years ago

So let's continue, shall we?

The VBAP renderer

I'm following src/vbaprenderer.h.

Unlike the main WFS class, this class is not derived from the strange SourceToOutput thingy and it doesn't have a SourceChannel class, so it's probably simpler?

The "process" function of the main class is a bit more complicated than in WFS. After processing the inputs (which is done automatically), there seem to be some loudspeaker-angle-related calculations. Let's not worry too much about them now, but they'll have to be done in the OPSI renderer, too, I guess. After that, the _source_list is processed. And after that (again automatically) the output list. There is also a LoudspeakerEntry and a custom comparison function for sorting, but let's ignore that for now.

VbapRenderer::Source

In the process function of the Source, the two closest loudspeakers and their respective weights are determined. You'll see the VBAP equations in _calculate_loudspeaker_weights(). Note that no audio processing has happened yet.

VbapRenderer::RenderFunction

The RenderFunction only defines three functions, there is no update() this time. It's a bit different compared to the one in the WFS renderer, because this one doesn't use a crossfade but instead it uses parameter interpolation. For the OPSI renderer, you should decide for one or the other. I think it's easiest to use crossfading. The select() function is a bit shorter than for WFS, it just juggles a bit with the two weighting factors.

VbapRenderer::Output

Here again, the process function looks boring, but it kicks off all the audio processing. Under the hood, it iterates over all sources and calls select() for each. In there, it is checked if the current loudspeaker is one of the two relevant ones for the current source. If yes, the input signal is weighted and added to the output. And instead of crossfading when the source position changes, the weighting factors are interpolated linearly and a new interpolated value is used for each sample.

A possible OPSI renderer

As mentioned above, there are many ways to implement an OPSI renderer, I'm just mentioning a few ideas here.

I would mainly follow the WFS renderer (because it seems to be the more complicated one) and would therefore use SourceToOutput and SourceChannel and crossfading instead of linear interpolation.

Input

Just write into the delay line, move the pre-filter to after the "split".

Source

I guess this could be just a combination of the Source classes of WFS and VBAP renderer.

SourceChannel

Instead of directly "forwarding" the output of the delay line, I think this should do the low-pass filtering for the WFS part and it should probably also do the pre-filtering. It could probably do the convolution and provide the resulting buffer in its begin() and end() member functions, but using a fancy iterator (I'm thinking of the apf::transform_iterator) that handles the IIR filtering of the lowpass.

In addition to the stuff from the WFS renderer, this should probably provide a second delay time that's slightly shifted to compensate for the pre-filter's group delay (as Jens mentioned above). And it can probably also provide this as a fancy iterator that includes the IIR high-pass filter. Since SourceChannel can only have one pair of begin() and end(), you'll probably have to create a member object that has itself begin() and end(). But let's not dive to deeply for now ...

Output

Let's be optimistic and say that the splitting in to high-pass and low-pass signal worked. Now we'll have to find a way to combine them again ...

Initially, I thought about using a single "combiner" that takes care of both at the same time. It might still be an option, but I have the feeling that it might be easier to use two "combiners". The "WFS combiner" would go over the sourcechannels and do whatever the WFS renderer does, except that now it would automatically get the low-pass filtered signal. The "VBAP combiner" would not go directly over the sourcechannels, but somehow over those ominous sub-objects that have a slightly shifted delay and incorporate the IIR high-pass filter.

But there is a problem ... The "combiner" needs a "range" to work with (i.e. something with begin() and end()), and each element has to produce something that again has begin() and end() (which will provide the actual audio samples). I'm sure there is some iterator magic that can be done here, but I'm not sure yet how exactly that could look like.

And there is another problem ... The "combiners" as they are implemented currently, make sure that they delete the output buffer first. But that's not what we want for the second "combiner"! I guess this can be solved by extending the current "combiners" with an initialization parameter that sets the _accumulate flag that's currently only used internally. There also seem to be the _accumulate_fade_in and _accumulate_fade_out flags, which I don't really remember what they do, I'll have to look into that.

But I guess if those two things are solved, it should work!

Does any of what I said make sense? Although I've written quite a bit, there are still a lot of additional details to consider. Feel free to ask more questions!

If you want to follow my suggestions from above, I think the best way forward would be to make the mentioned changes to the WFS part and completely ignore the VBAP part for the moment. As a first step I would probably just move the pre-filter to a later point and leave everything else exactly as in the WFS renderer. There is already a lot that can go wrong. But if you manage to move the pre-filter and afterwards the renderer still works, that's a nice achievement and you can tackle the next thing, e.g. adding the IIR low-pass. And only when that is working, you should start trying to add the second "combiner" ...

mgeier commented 7 years ago

I thought about it a bit more and I think it is not feasible to implement the IIR filters intermingled with the crossfading. I think it's best to implement the IIR filters after the crossfade is done. Since the filter coefficients don't change, the filtering can be moved to after the crossfade.

I suggest the following change to what I've written above: The IIR filters should not be implemented in SourceChannel but in the Output class. For a given output, the WFS contributions of all sources are added up (with a "combiner") directly into the final output buffer. Thereafter, this buffer is filtered in-place with the low-pass IIR filter. Probably the FIR pre-filter can also be applied here? The VBAP contributions are added up (with another "combiner") into a separate buffer (which I didn't mention above) and this separate buffer is filtered with the high-pass IIR filter and the result is added to the final output.

This way, no changes have to be made to the "combiner", but we need an additional audio buffer in the Output class.

BTW, iterating over the sourcechannels list while accessing a data member of the SourceChannel objects can easily be done with an apf::transform_iterator or an apf::transform_proxy. There is an example for that in the unit tests.

With that, I guess the two problems I mentioned above are solved.

rowalz commented 7 years ago

Thank you for your extensive considerations. I didn't get through them in ne necessary depth unfortunately, because I have a lot of other stuff to do at the moment. Also I decided to spare out the opsi renderer in my bachelor thesis (which was the cause for the implementation originally) due to different reasons. Thats why the pressure is pretty much reduced. But probably I will return to the matter and get back to this article.