breakfastquay / rubberband

Official mirror of Rubber Band Library, an audio time-stretching and pitch-shifting library.
http://breakfastquay.com/rubberband/
GNU General Public License v2.0
566 stars 91 forks source link

v1.9.1: WARNING: MovingMedian: NaN encountered #37

Closed uklotzde closed 3 years ago

uklotzde commented 3 years ago

We are experiencing issues when running one of our test cases:

https://bugs.launchpad.net/mixxx/+bug/1921955

WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
WARNING: MovingMedian: NaN encountered
corrupted size vs. prev_size
cannam commented 3 years ago

Thanks for the report! What can you tell me about the context:

Thank you! I'll take another close look at the regression tests here meanwhile.

cannam commented 3 years ago

This warning does not appear in any of the existing regression tests here for kissfft+libsamplerate or fftw+libsamplerate in Rubber Band 1.9.1 on 64-bit Linux.

cannam commented 3 years ago

Nothing from asan or valgrind. Will await more info before investigating further.

uklotzde commented 3 years ago

Unfortunately I am not familiar with this part of our code base. Just wanted to let everyone know that the new version seems to behave differently. The cause could be anything, either wrong usage on our side, undefined behavior, or algorithmic errors.

I first noticed these failures during our RPM Fusion CI builds (armv7, aarch64, ppc64). But now after recent updates that included v1.9.1 on Fedora 33 I can reproduce the test failures reliably on x86_64, each run fails.

cannam commented 3 years ago

Can I run these tests?

uklotzde commented 3 years ago

Build instructions can be found in the Wiki: Compile Mixxx From Source Code

The tests are build together with the application. After the CMake build the tests are run from the build folder using CTest:

ctest --output-on-failure --tests-regex EngineBufferE2ETest.RubberbandReverseTest
uklotzde commented 3 years ago

The CI builds for all platforms on GitHub still use a previous Rubber Band version and don't fail.

cannam commented 3 years ago

Great, thanks! I can reproduce the failure here. I'm about to clock off for the long weekend, will look into it properly next week.

cannam commented 3 years ago

By the way, although I am always grateful for bug reports no matter how sketchy, it would have been very useful to have had the information you just provided (about how to reproduce it) in the initial report.

With this extra info I could possibly have had a fix by now, or (if not a bug) a suggestion about how to avoid the problem. Without it I just spent quite a lot of time yesterday failing to reproduce the issue, and am now looking to the middle of next week to investigate it properly.

Again I'm only saying this because I hope it can be helpful in future - it's good to have bug reports even if they are tricky to follow up, and I'm happy that you filed it.

cannam commented 3 years ago

I believe this is an unintended consequence of my having switched the default Rubber Band build from FFTW to KissFFT.

The trace from a Valgrind warning before the first NaN is detected (with a debug build of Rubber Band using KissFFT) looks like this at the boundary between Rubber Band and KissFFT

...
==9709==    by 0xD51D50: kiss_fft (in /home/cannam/code/thirdparty/mixxx/build/mixxx-test)
==9709==    by 0xD53913: kiss_fftr (in /home/cannam/code/thirdparty/mixxx/build/mixxx-test)
==9709==    by 0x62E8B34: RubberBand::FFTs::D_KISSFFT::forwardPolar(double const*, double*, double*) (FFT.cpp:2571)
==9709==    by 0x62E6EED: RubberBand::FFT::forwardPolar(double const*, double*, double*) (FFT.cpp:3266)
...

As you see, the call from RubberBand::FFT is actually going back into Mixxx code, using the copy of kiss_fftr that is compiled in with Mixxx rather than the one included in the Rubber Band shared object. This is a problem because Rubber Band expects KissFFT to have been compiled with its default options, using single-precision floats, while the version in Mixxx is compiled with -Dkiss_fft_scalar=double so expects buffers to be double-precision.

Especially ironic is the fact that the Mixxx code only uses that flag because it was needed for the QM DSP library, the relevant part of which was also written by me. So it's my fault whichever way you look at it! Doh.

This sort of thing comes up rather a lot with the not-quite-library status of KissFFT, which is typically vendored in, has a binary interface that depends on compile flags, and provides no easy way to mangle the symbol names. I had actually decided a few weeks ago that the use of KissFFT by default in Rubber Band wasn't such a good idea after all, partly because of such issues and partly because I re-tested it and found it was slower than I had remembered. For 1.9.2 I'm going to undo that decision and ensure that problems like this don't arise in future.

In the mean time, what can be done? Some notes:

I can probably provide some appropriate fix for use in the QM DSP code - I no longer work on that library officially, but I know this sort of problem is frustrating and I would like to see it resolved. Appreciate your views.

uklotzde commented 3 years ago

Thank you for your thorough and extensive analysis!

According to your conclusions the only option we have is switching from kiss_fft_scalar=double to kiss_fft_scalar=float in our build. We are still bundling the QM DSP code including KissFFT.

Test build: https://github.com/mixxxdj/mixxx/pull/3774

uklotzde commented 3 years ago

Unfortunately this doesn't work. We need to keep kiss_fft_scalar=double.

Be-ing commented 3 years ago

@cannam could you make a 1.9.2 release now just to revert the default to FFTW? Otherwise we can tell Linux distros to explicitly build with FFTW.

uklotzde commented 3 years ago

I will prepare a PR for Fedora after testing locally. They are also using speex although libsamplerate is available.

cannam commented 3 years ago

@uklotzde

Unfortunately this doesn't work. We need to keep kiss_fft_scalar=double.

Do you say this just because, with this change in the build, the code doesn't compile? Or have you changed the code so it does compile, but found that the change in precision breaks something?

Just changing the option at build time certainly won't work; it does need a code change too. This one should suffice, I think:

--- a/dsp/transforms/FFT.cpp
+++ b/dsp/transforms/FFT.cpp
@@ -107,18 +107,24 @@ public:
         }
         m_planf = kiss_fftr_alloc(m_n, 0, NULL, NULL);
         m_plani = kiss_fftr_alloc(m_n, 1, NULL, NULL);
+        m_tmp = new kiss_fft_scalar[m_n];
         m_c = new kiss_fft_cpx[m_n];
     }

     ~D() {
         kiss_fftr_free(m_planf);
         kiss_fftr_free(m_plani);
+        delete[] m_tmp;
         delete[] m_c;
     }

     void forward(const double *ri, double *ro, double *io) {

-        kiss_fftr(m_planf, ri, m_c);
+        for (int i = 0; i < m_n; ++i) {
+            m_tmp[i] = ri[i];
+        }
+        
+        kiss_fftr(m_planf, m_tmp, m_c);

         for (int i = 0; i <= m_n/2; ++i) {
             ro[i] = m_c[i].r;
@@ -154,12 +160,12 @@ public:
             m_c[i].i = ii[i];
         }

-        kiss_fftri(m_plani, m_c, ro);
+        kiss_fftri(m_plani, m_c, m_tmp);

         double scale = 1.0 / m_n;

         for (int i = 0; i < m_n; ++i) {
-            ro[i] *= scale;
+            ro[i] = m_tmp[i] * scale;
         }
     }

@@ -167,6 +173,7 @@ private:
     int m_n;
     kiss_fftr_cfg m_planf;
     kiss_fftr_cfg m_plani;
+    kiss_fft_scalar *m_tmp;
     kiss_fft_cpx *m_c;
 };

It will produce slightly different results, which probably won't make a material difference to anything, but might, and it could break regression tests that depend on close values at double precision.

@Be-ing

could you make a 1.9.2 release now just to revert the default to FFTW?

No, I don't think this is a good idea either - there were good reasons to move from FFTW as well for the default build (although it is the best choice for distro packagers who have FFTW available as a known quantity and know they want to link dynamically).

Instead I actually have yet another FFT implementation which I'm going to bring in as the default in the next Rubber Band release - it is already in the bqfft library. This is if anything even more boring than KissFFT - it combines aspects of the ancient Don Cross public-domain FFT and the real-complex shortcut as used in KissFFT - it's simple, not all that fast but quicker for me than KissFFT, reasonably well tested, and easier to use in C++ code without these linkage problems. I think it's a fair compromise.

But on its own this wouldn't perfectly solve your problem either - as long as Mixxx is compiled with a non-standard configuration of kissfft and also linking against other code that might be configured to use the standard, incompatible configuration, then there is still danger (and I am now thinking of all sorts of places where this could be problematic in other code I've worked on too). The above patch is a "safer" approach in this sense, because it puts the solution into your hands rather than relying on external factors.

Perhaps what I should do is update to a self-contained implementation in (a fork of) qm-dsp as well, which you could update your vendored copy of if you wish - then we have both safety and double-precision.

uklotzde commented 3 years ago

I don't think that moving back and forth by patching QM DSP in each project independently is a good idea.

uklotzde commented 3 years ago

https://src.fedoraproject.org/rpms/rubberband/pull-request/1

cannam commented 3 years ago

https://src.fedoraproject.org/rpms/rubberband/pull-request/1

Thank you - I will add a note to the Rubber Band README as well, aimed at distro packagers.

cannam commented 3 years ago

v1.9.2 has the built-in FFT which is now used by default everywhere except on Apple platforms (which have vDSP). There is also a note for packagers in the README. I think this can be closed.