Plugin consumes even more CPU when idle

AnClark commented 8 months ago

Hi Wasted Audio Team,

I've encountered a strange issue when using WSTD EQ on REAPER for Linux. If the plugin is processing audio, CPU usage is below 1.0% on average. However, when I click "Stop" on REAPER, CPU usage will terribly increase to 7.0%.

See the following screenshots:

On processing:

On idle:

My system environment:

PC: ThinkPad R400
CPU: Intel(R) Core(TM)2 Duo CPU P9500 @ 2.53GHz
OS: Arch Linux
DAW: REAPER v6.83
WSTD EQ version: v1.0 (official release)

AnClark commented 8 months ago

This is a Linux perf stat when testing with REAPER (VST3 edition). I stayed for a little long time on idle.

perf.data.tar.gz.

Here are some screenshots:

dromer commented 8 months ago

Hey @AnClark thank you for the detailed report.

This will take some dedicated time to figure out. I haven't seen this issue before and I can't directly think of what could cause it.

It might be related to the framework we use. Have you used any other DPF based plugins before that show a similar load increase when transport is stopped?

dromer commented 8 months ago

Hmm, so from your perf inspection it seems that the biquad filters take a lot of time on your machine

I'm quickly trying this with the v1.0 release in REAPER on my AMD Ryzen 5 (quite a bit more performant than your ancient C2D). And I don't see any such discrepancies:

Idle:

Active:

Occasionally I see a tiny "jump", when stopping, to 0.03% but it quickly goes down to 0.02% again. Not sure how else I could reproduce.

I am considering to enable SSE4.1 for all plugins this year, which should give a near 4x performance increase. This instruction set is supported for C2D at least. Maybe we can do some preliminary tests to see if this improves this a bit for you.

dromer commented 8 months ago

Btw it seems your perf.data file is incompatible with my system, so I cannot read the output myself.

I'm guessing those visual stats are also an extra feature of that version, which I don't seem to have.

AnClark commented 8 months ago

Occasionally I see a tiny "jump", when stopping, to 0.03% but it quickly goes down to 0.02% again.

Not sure how else I could reproduce.

Here's another way you can reproduce the issue:

Add a new track, and load WSTD EQ;
Create a new MIDI item, and add JSFX "White Noise Generator" to Take FX;
Switch on repeat (activate the "Toggle Repeat" button);
Play.

WSTD EQ still consumes more CPU when I stopped playing.

It's strange that during processing, biquad filter works perfectly. Only if I stopped transport, the filter begins to consume CPU.

Is it possible that any inappropriate samples were being processed by DSP, which made it misbehave?

AnClark commented 8 months ago

It might be related to the framework we use. Have you used any other DPF based plugins before that show a similar load increase when transport is stopped?

Yes. I'm porting some LV2 plugins to DPF. Both of the following plugins used to have similar issue:

Both of them have a Moog-style filter. If I stopped transport, the filter will increase the CPU load tremendously. Currently I didn't figure out why it happens, so I just made a workaround: bypass filters if oscillators does not send samples to them.

dromer commented 8 months ago

Here's another way you can reproduce the issue:

1. Add a new track, and load WSTD EQ;

2. Create a new MIDI item, and add JSFX "White Noise Generator" to Take FX;

3. Switch on repeat (activate the "Toggle Repeat" button);

4. Play.

I tried following these instructions. I have a midi section, JS: White noise Generator, then VST3: WSTD EQ. playing this selection on repeat and playing or not playing it doesn't get beyond 0.03%

AnClark commented 8 months ago

I’ve checked __hv_biquad_f() in generated code.

Seems that newer CPU like your Ryzen 5 enabled solution(s) optimized with AVX or SSE 4.1, while my ancient C2D only supports SSE and SSE2, so it fallbacks to this simple solution:

  const float y = bIn*bX0 + o->xm1*bX1 + o->xm2*bX2 - o->ym1*bY1 - o->ym2*bY2;
  o->xm2 = o->xm1; o->xm1 = bIn;
  o->ym2 = o->ym1; o->ym1 = y;
  *bOut = y;

However it's still strange: this solution performs quite well on transport, but CPU load increases when transport stops.

Full `__hv_biquad_f()` code in WSTD EQ

```c++ #if _WIN32 && !_WIN64 void __hv_biquad_f_win32(SignalBiquad *o, hv_bInf_t *_bIn, hv_bInf_t *_bX0, hv_bInf_t *_bX1, hv_bInf_t *_bX2, hv_bInf_t *_bY1, hv_bInf_t *_bY2, hv_bOutf_t bOut) { hv_bInf_t bIn = *_bIn; hv_bInf_t bX0 = *_bX0; hv_bInf_t bX1 = *_bX1; hv_bInf_t bX2 = *_bX2; hv_bInf_t bY1 = *_bY1; hv_bInf_t bY2 = *_bY2; #else void __hv_biquad_f(SignalBiquad *o, hv_bInf_t bIn, hv_bInf_t bX0, hv_bInf_t bX1, hv_bInf_t bX2, hv_bInf_t bY1, hv_bInf_t bY2, hv_bOutf_t bOut) { #endif #if HV_SIMD_AVX __m256 x = _mm256_permute_ps(bIn, _MM_SHUFFLE(2,1,0,3)); // [3 0 1 2 7 4 5 6] __m256 y = _mm256_permute_ps(o->x, _MM_SHUFFLE(2,1,0,3)); // [d a b c h e f g] __m256 n = _mm256_permute2f128_ps(y,x,0x21); // [h e f g 3 0 1 2] __m256 xm1 = _mm256_blend_ps(x, n, 0x11); // [h 0 1 2 3 4 5 6] x = _mm256_permute_ps(bIn, _MM_SHUFFLE(1,0,3,2)); // [2 3 0 1 6 7 4 5] y = _mm256_permute_ps(o->x, _MM_SHUFFLE(1,0,3,2)); // [c d a b g h e f] n = _mm256_permute2f128_ps(y,x,0x21); // [g h e f 2 3 0 1] __m256 xm2 = _mm256_blend_ps(x, n, 0x33); // [g h 0 1 2 3 4 5] __m256 a = _mm256_mul_ps(bIn, bX0); __m256 b = _mm256_mul_ps(xm1, bX1); __m256 c = _mm256_mul_ps(xm2, bX2); __m256 d = _mm256_add_ps(a, b); __m256 e = _mm256_add_ps(c, d); // bIn*bX0 + o->x1*bX1 + o->x2*bX2 float y0 = e[0] - o->ym1*bY1[0] - o->ym2*bY2[0]; float y1 = e[1] - y0*bY1[1] - o->ym1*bY2[1]; float y2 = e[2] - y1*bY1[2] - y0*bY2[2]; float y3 = e[3] - y2*bY1[3] - y1*bY2[3]; float y4 = e[4] - y3*bY1[4] - y2*bY2[4]; float y5 = e[5] - y4*bY1[5] - y3*bY2[5]; float y6 = e[6] - y5*bY1[6] - y4*bY2[6]; float y7 = e[7] - y6*bY1[7] - y5*bY2[7]; o->x = bIn; o->ym1 = y7; o->ym2 = y6; *bOut = _mm256_set_ps(y7, y6, y5, y4, y3, y2, y1, y0); #elif HV_SIMD_SSE __m128 n = _mm_blend_ps(o->x, bIn, 0x7); // [a b c d] [e f g h] = [e f g d] __m128 xm1 = _mm_shuffle_ps(n, n, _MM_SHUFFLE(2,1,0,3)); // [d e f g] __m128 xm2 = _mm_shuffle_ps(o->x, bIn, _MM_SHUFFLE(1,0,3,2)); // [c d e f] __m128 a = _mm_mul_ps(bIn, bX0); __m128 b = _mm_mul_ps(xm1, bX1); __m128 c = _mm_mul_ps(xm2, bX2); __m128 d = _mm_add_ps(a, b); __m128 e = _mm_add_ps(c, d); const float *const bbe = (float *) &e; const float *const bbY1 = (float *) &bY1; const float *const bbY2 = (float *) &bY2; float y0 = bbe[0] - o->ym1*bbY1[0] - o->ym2*bbY2[0]; float y1 = bbe[1] - y0*bbY1[1] - o->ym1*bbY2[1]; float y2 = bbe[2] - y1*bbY1[2] - y0*bbY2[2]; float y3 = bbe[3] - y2*bbY1[3] - y1*bbY2[3]; o->x = bIn; o->ym1 = y3; o->ym2 = y2; *bOut = _mm_set_ps(y3, y2, y1, y0); #elif HV_SIMD_NEON float32x4_t xm1 = vextq_f32(o->x, bIn, 3); float32x4_t xm2 = vextq_f32(o->x, bIn, 2); float32x4_t a = vmulq_f32(bIn, bX0); float32x4_t b = vmulq_f32(xm1, bX1); float32x4_t c = vmulq_f32(xm2, bX2); float32x4_t d = vaddq_f32(a, b); float32x4_t e = vaddq_f32(c, d); float y0 = e[0] - o->ym1*bY1[0] - o->ym2*bY2[0]; float y1 = e[1] - y0*bY1[1] - o->ym1*bY2[1]; float y2 = e[2] - y1*bY1[2] - y0*bY2[2]; float y3 = e[3] - y2*bY1[3] - y1*bY2[3]; o->x = bIn; o->ym1 = y3; o->ym2 = y2; *bOut = (float32x4_t) {y0, y1, y2, y3}; #else const float y = bIn*bX0 + o->xm1*bX1 + o->xm2*bX2 - o->ym1*bY1 - o->ym2*bY2; o->xm2 = o->xm1; o->xm1 = bIn; o->ym2 = o->ym1; o->ym1 = y; *bOut = y; #endif } ```

dromer commented 8 months ago

As I said we do not build with SIMD optimizations yet (only on ARM).

Your CPU should support SSE4.1 which I might enable later this year. C2D is about 15 years old now.

You could try this optimization by adding -msse41 to the CXXFLAGS in the plugin/source/Makefile.

AnClark commented 8 months ago

I have a newer ThinkPad X201 Tablet. It has a Core 1st Gen processor (Core i7 L 640).

I enabled -msse41, and tested again. Even though SIMD instructions reduced CPU usages by 1.0% on idle, the problem still exists.

Sounds like we have something to do with the algorithm.

AnClark commented 8 months ago

For reference, here's a Moog-style filter from RaffoSynth, which has the same problem as I described:

//hace lo mismo que la versión en asm
void equalizer(float* buffer, float* prev_vals, uint32_t sample_count, float psuma0, float psuma2, float psuma3, float ssuma0, float ssuma1, float ssuma2, float ssuma3, float factorSuma2){
    float psuma1 = psuma0 *2;
  for (int i = 0; i < sample_count; i++) {
    //low-pass filter    

    float temp = buffer[i];
    buffer[i] *= psuma0;    //psuma0 == factorsuma1
    buffer[i] += psuma0 * prev_vals[0] + psuma1 * prev_vals[1] 
                    + psuma2 * prev_vals[2] + psuma3* prev_vals[3];
    prev_vals[0] = prev_vals[1];
    prev_vals[1] = temp;

    // peaking EQ (resonance)
    float temp2 = buffer[i];

    buffer[i] *= factorSuma2;
    buffer[i] += ssuma0 * prev_vals[2] + ssuma1 * prev_vals[3] 
                    + ssuma2 * prev_vals[4] + ssuma3 * prev_vals[5];
    prev_vals[2] = prev_vals[3];
    prev_vals[3] = temp;
    prev_vals[4] = prev_vals[5];
    prev_vals[5] = buffer[i];
    }
}

dromer commented 8 months ago

I got a hint from FalkTX on what could be going on. Can you perhaps try the following?

To the top of WSTD_EQ/plugin/source/HeavyDPF_WSTD_EQ.cpp add

#include "extra/ScopedDenormalDisable.hpp"

And in the run function set the following:

  const ScopedDenormalDisable sdd;
  const TimePosition& timePos(getTimePosition());

Rebuild and try again.

AnClark commented 8 months ago

@dromer OK. I'll try tonight (BJT), and give you report.

dromer commented 8 months ago

@AnClark you can try this build when it finishes: https://github.com/Wasted-Audio/wstd-eq/actions/runs/7431093334

AnClark commented 8 months ago

I got a hint from FalkTX on what could be going on. Can you perhaps try the following?

To the top of WSTD_EQ/plugin/source/HeavyDPF_WSTD_EQ.cpp add
#include "extra/ScopedDenormalDisable.hpp"
And in the run function set the following:
  const ScopedDenormalDisable sdd;
  const TimePosition& timePos(getTimePosition());
Rebuild and try again.

Great! By adding those lines, and build with -O3 CXX flag, problem resolved. Now CPU usage is about 0.6% on idle.

AnClark commented 8 months ago

@AnClark you can try this build when it finishes: https://github.com/Wasted-Audio/wstd-eq/actions/runs/7431093334

I've also tested your build.

Your build has better performance than mine. CPU usage is not beyond 0.5% on idle. So disabling denormal numbers really works.

dromer commented 8 months ago

Cool! thank you for confirming. I guess on older systems as yours this really makes a difference. On my machines I couldn't spot any significant change.

Now comes the question on how to best apply this, as setting this option can potentially break things as well ..

AnClark commented 8 months ago

My pleasure!

It would be better if there were any document for ScopedDenormalDisable. It's the first time I know this API. I wonder if it's proved stable by FalkTX and contributors.

Also you can do more tests on other platforms, including Apple Silicon. All of my machines are not newer than Core i5 5th-Gen.

dromer commented 8 months ago

I do not own any Windows or MacOS machines, so doing "proper" testing on those is not possible. What I generally do is pass builds to friends and ask them to report if it works :shrug:

dromer commented 8 months ago

Btw the only documentation for this class is in the code: https://github.com/DISTRHO/DPF/blob/main/distrho/extra/ScopedDenormalDisable.hpp

AnClark commented 8 months ago

I've found a solution: add a new entry in HVCC JSON metadata (e.g. dpf.enable_denormal_number_fix or other better name), to control whether to enable this fix or not. So we can only apply this fix on WSTD EQ, and let other products uneffected.

What's more, we can also provide 2 builds of WSTD EQ since next release. One applys this fix, and the other one keeps as-is.

dromer commented 8 months ago

I don't see any reason to provide two completely separate builds of the same plugin, that doesn't make any sense. Either such a patch will be in place, or it won't.

Having it as a configurable option in the json is a nice idea, so it won't be put there automatically for all DPF builds. I'd like to know more about the implications of the patch and how it could disrupt plugin and host behavior before moving forward with a permanent solution.

AnClark commented 8 months ago

Maybe I can help test on Windows (as well as Wine). I have a Hewlett-Packard Pavillion with Windows 11 and Msys2 installed (though it uses i7-5500U).

What's more, if WSTD and HVCC had unit test (or benchmark test) it would also help a lot.

dromer commented 8 months ago

HVCC does have some testing in place (although not everything works), but that's a discussion for a different project :)

AnClark commented 8 months ago

So how could we do tests? Maybe we can make a roadmap for testing plugins (maybe not limited to WSTD EQ). For example, specify test cases and target DAWs.

Wasted-Audio / wstd-eq

Plugin consumes even more CPU when idle #3