Chowdhury-DSP / chowdsp_fft

5 stars 0 forks source link

Using chowdsp::fft::FFT_REAL may cause crashes #1

Open lewloiwc opened 1 month ago

lewloiwc commented 1 month ago

Hello.

I happened to find your article when I searched for "PFFFT avx", and since I was using single-precision PFFFT, I was trying to switch to chowdsp_fft. However, as a result, it started crashing about once every two times, and after a bit of investigation, it seems to be happening probably when it's chowdsp::fft::FFT_REAL.

I was testing with code like this that displays the frequency response on Windows 10:

#include <cstdint>
#include <cmath>
#include <windows.h>
#include <Windowsx.h>

#ifdef DEFINE_PFFFT_MODE
    #include "../../../lib/pffft-master/pffft.h"
#else
    #include "../../../lib/chowdsp_fft-main/chowdsp_fft.h"
#endif

#define DEFINE_W 640
#define DEFINE_H 480

class chowdsp_fft_test {
    public:
        int32_t N;
        #ifdef DEFINE_PFFFT_MODE
            PFFFT_Setup *setup;
        #else
            void *setup;
        #endif
        float *input;
        float *output;

        float display_min;
        float display_max;
        uint32_t time;
        uint32_t cnt;
        float buf[32768];

        void create(int32_t FFT_SIZE,double DISPLAY_MIN,double DISPLAY_MAX) {
            N = FFT_SIZE;
            int32_t cplx = 0;
            int32_t Nfloat = (cplx ? N*2 : N);
            int32_t Nbytes = Nfloat*sizeof(float);
            #ifdef DEFINE_PFFFT_MODE
                setup = pffft_new_setup(N,cplx ? PFFFT_COMPLEX : PFFFT_REAL);
                input = (float *)pffft_aligned_malloc(Nbytes);
                output = (float *)pffft_aligned_malloc(Nbytes);
            #else
                setup = chowdsp::fft::fft_new_setup(N,cplx ? chowdsp::fft::FFT_COMPLEX : chowdsp::fft::FFT_REAL);
                input = (float *)chowdsp::fft::aligned_malloc(Nbytes);
                output = (float *)chowdsp::fft::aligned_malloc(Nbytes);
            #endif
            memset(input,0,Nbytes);
            memset(output,0,Nbytes);

            display_min = DISPLAY_MIN;
            display_max = DISPLAY_MAX;
            time = 32768 - 1;
        }

        void destroy(void) {
            #ifdef DEFINE_PFFFT_MODE
                pffft_destroy_setup(setup);
                pffft_aligned_free(input);
                pffft_aligned_free(output);
            #else
                chowdsp::fft::fft_destroy_setup(setup);
                chowdsp::fft::aligned_free(input);
                chowdsp::fft::aligned_free(output);
            #endif
        }

        void input_audio(float x) {
            buf[cnt] = x;
            if (cnt < time) cnt++; else cnt = 0;
        }

        int32_t get_x(int32_t index) {
            double hz = std::max(48000.0/N*index,0.1);
            double x = (log(std::abs(hz)) - log(20.0))/(log(20000.0/20.0)/(DEFINE_W - 1));
            return std::min(std::max((int32_t)floor(x + 0.5),-10),DEFINE_W + 10);
        }

        float get_y(int32_t index) {
            float y;
            if (index == 0)
                y = std::abs(output[0]);
            else if (index == N >> 1)
                y = std::abs(output[1]);
            else
                y = std::hypot(output[index*2],output[index*2 + 1]);
            float dB = -20/log(10)*(DEFINE_H - 1.0)/(-display_min + display_max);
            float dB_offset = (DEFINE_H - 1)*(display_max/(-display_min + display_max));
            int32_t t = floor(log(y)*dB + dB_offset + 0.5);
            return std::min(std::max(t,-1),DEFINE_H + 1);
        }

        void proc(HDC DC) {
            for (int32_t i = 0;i < N;i++) {
                auto index = (cnt + (32768 - N) + i)%32768;
                float window = 0.5 - 0.5*cos(2.0*3.141592*i/N);
                input[i] = buf[index]*window/N*4.0;
            }

            #ifdef DEFINE_PFFFT_MODE
                pffft_transform_ordered(setup,input,output,NULL,PFFFT_FORWARD);
            #else
                chowdsp::fft::fft_transform(setup,input,output,NULL,chowdsp::fft::FFT_FORWARD);
            #endif

            POINT point[N/2 + 1];
            for (int32_t i = 0;i < N/2 + 1;i++) {
                point[i].x = get_x(i);
                point[i].y = get_y(i);
            }
            Polyline(DC,point,N/2 + 1);
        }
};

chowdsp_fft_test FS;
uint64_t cnt;
double phase;

LRESULT CALLBACK WindowProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam) {
    switch (uMsg) {
        case WM_CREATE: {
            FS.create(8192,-80,20);
            return 0;
        }
        case WM_DESTROY: {
            FS.destroy();
            PostQuitMessage(0);
            return 0;
        }
        case WM_PAINT: {
            PAINTSTRUCT ps;
            HDC dc_main = BeginPaint(hwnd, &ps);
            HDC dc_mem = CreateCompatibleDC(dc_main);
            RECT rcClientRect;
            GetClientRect(hwnd,&rcClientRect);
            int32_t w = rcClientRect.right - rcClientRect.left;
            int32_t h = rcClientRect.bottom - rcClientRect.top;
            HBITMAP dc_mem_bmp = CreateCompatibleBitmap(dc_main,w,h);
            HBITMAP dc_mem_bmp_old = (HBITMAP)SelectObject(dc_mem,dc_mem_bmp);
            //--------------------------------
            for (uint32_t i = 0;i < 48000.0/60;i++) {
                double y = 0;

                y += sin(2.0*3.141592*(20.0)/48000.0*cnt);
                y += sin(2.0*3.141592*(632.45)/48000.0*cnt);
                y += sin(2.0*3.141592*(20000.0)/48000.0*cnt);

                double lfo = sin(2.0*3.141592*(0.25)/48000.0*cnt)*0.5 + 0.5;
                phase += 3.141592*(10000*lfo)/48000.0;
                y += sin(phase);

                double random_double = 2.0 * (std::rand() / static_cast<double>(RAND_MAX)) - 1.0;
                y += random_double*0.1;

                FS.input_audio(y);

                cnt++;
            }

            SetDCPenColor(dc_mem,RGB(30,30,30));
            SetDCBrushColor(dc_mem,RGB(30,30,30));
            SelectObject(dc_mem,GetStockObject(DC_PEN));
            SelectObject(dc_mem,GetStockObject(DC_BRUSH));
            Rectangle(dc_mem,0,0,DEFINE_W,DEFINE_H);

            SetDCPenColor(dc_mem,RGB(200,200,200));
            SetDCBrushColor(dc_mem,RGB(200,200,200));
            SelectObject(dc_mem,GetStockObject(DC_PEN));
            SelectObject(dc_mem,GetStockObject(DC_BRUSH));
            FS.proc(dc_mem);
            //--------------------------------
            BitBlt(dc_main,0,0,w,h,dc_mem,0,0,SRCCOPY);
            SelectObject(dc_mem,dc_mem_bmp_old);
            DeleteObject(dc_mem_bmp);
            DeleteDC(dc_mem);

            EndPaint(hwnd, &ps);
        }
        case WM_MOUSEMOVE: {
            return 0;
        }
        case WM_TIMER: {
            RedrawWindow(hwnd,0,0,RDW_INVALIDATE);
            return 0;
        }
        default: {
            return DefWindowProc(hwnd, uMsg, wParam, lParam);
        }
    }
}

int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow) {
    const char CLASS_NAME[]  = "Sample Window Class";

    WNDCLASS wc = {};
    wc.lpfnWndProc   = WindowProc;
    wc.hInstance     = hInstance;
    wc.lpszClassName = CLASS_NAME;
    wc.hCursor       = LoadCursor(NULL,IDC_ARROW);
    RegisterClass(&wc);

    DWORD dwStyle = WS_OVERLAPPEDWINDOW;
    RECT rect = {0, 0, DEFINE_W, DEFINE_H};
    AdjustWindowRect(&rect, dwStyle, FALSE);
    int windowWidth = rect.right - rect.left;
    int windowHeight = rect.bottom - rect.top;

    HWND hwnd = CreateWindowEx(0,CLASS_NAME,"chowdsp_fft_test",dwStyle,50,50,windowWidth,windowHeight,NULL,NULL,hInstance,NULL);
    if (hwnd == NULL) return 0;
    SetTimer(hwnd,1,15,NULL);
    ShowWindow(hwnd, nCmdShow);

    MSG msg = {};
    while (GetMessage(&msg, NULL, 0, 0)) {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }

    return 0;
}

I build chowdsp_fft by entering the following in PowerShell:

mkdir build;cmake -B build -G "MinGW Makefiles" -DCMAKE_CXX_FLAGS="-s -DNDEBUG -O3 -mavx2 -mfma";cmake --build build

I compile the PFFFT version by entering the following in PowerShell:

g++ -mavx2 -mfma a.cpp -o a.exe -mwindows -L../../../lib/pffft-master/build -lpffft -DDEFINE_PFFFT_MODE

I compile the chowdsp_fft version by entering the following in PowerShell:

g++ -mavx2 -mfma a.cpp -o a.exe -mwindows -L../../../lib/chowdsp_fft-main/build -lchowdsp_fft -lchowdsp_fft_avx

If it starts up normally, a screen like this will be displayed: image

The PFFT version works fine no matter how many times I launch the application, but the chowdsp_fft version crashes about half the time immediately after startup. Could you reproduce this bug on your end? I'm not really knowledgeable about programming, so I might be building the library incorrectly. Windows 10 | gcc (x86_64-win32-seh-rev0, Built by MinGW-W64 project) 12.2.0

Also, this is a feature request, but I'd like functions that don't perform internal zreorder for faster processing, and functions that perform convolution. Currently, I'm adding an argument like int ordered = 1 to fft_transform myself, or using simple convolution without SIMD, but I think it would be convenient if these were included in the library from the start.

//Additional arguments.
void fft_transform (void* setup, const float* input, float* output, float* work, fft_direction_t direction, int ordered = 1);

//Create as a separate function (which is faster, adding arguments or creating a new function?)
void fft_transform_unordered (void* setup, const float* input, float* output, float* work, fft_direction_t direction);

//Simple unordered convolution
for (uint32_t i = 0;i < Nfloat;i += 16) {
    for (uint32_t j = 0;j < 8;j++) {
        float ar = a[i + j];
        float ai = a[i + j + 8];
        float br = b[i + j];
        float bi = b[i + j + 8];
        ab[i + j] = (ar*br - ai*bi)*scaling;
        ab[i + j + 8] = (ar*bi + ai*br)*scaling;
    }
}

Lastly, thank you for releasing such a wonderful library!!


P.S.

I’ve created a simpler version of the code for testing:

#include <cstdint>
#include <cstring>
#include <cmath>
#include <iostream>

#ifdef DEFINE_PFFFT_MODE
    #include "../../../lib/pffft-master/pffft.h"
#else
    #include "../../../lib/chowdsp_fft-main/chowdsp_fft.h"
#endif

int main() {
    std::cout << "---- start ----" << "\n";

    #ifdef DEFINE_PFFFT_MODE
        PFFFT_Setup *setup;
    #else
        void *setup;
    #endif
    float *input;
    float *output;
    uint32_t N = 256;
    uint32_t cplx = 0;
    uint32_t Nfloat = (cplx ? N*2 : N);
    uint32_t Nbytes = Nfloat*sizeof(float);
    #ifdef DEFINE_PFFFT_MODE
        setup = pffft_new_setup(N,cplx ? PFFFT_COMPLEX : PFFFT_REAL);
        input = (float *)pffft_aligned_malloc(Nbytes);
        output = (float *)pffft_aligned_malloc(Nbytes);
    #else
        setup = chowdsp::fft::fft_new_setup(N,cplx ? chowdsp::fft::FFT_COMPLEX : chowdsp::fft::FFT_REAL);
        input = (float*)chowdsp::fft::aligned_malloc(Nbytes);
        output = (float*)chowdsp::fft::aligned_malloc(Nbytes);
    #endif
    memset(input,0,Nbytes);
    memset(output,0,Nbytes);

    for (uint32_t i = 0;i < Nfloat;i++) {
        input[i] = sin(i*0.2);
    }

    #ifdef DEFINE_PFFFT_MODE
        pffft_transform_ordered(setup,input,output,NULL,PFFFT_FORWARD);
    #else
        chowdsp::fft::fft_transform(setup,input,output,NULL,chowdsp::fft::FFT_FORWARD);
    #endif

    std::cout << output[0] << "\n";

    #ifdef DEFINE_PFFFT_MODE
        pffft_destroy_setup(setup);
        pffft_aligned_free(input);
        pffft_aligned_free(output);
    #else
        chowdsp::fft::fft_destroy_setup(setup);
        chowdsp::fft::aligned_free(input);
        chowdsp::fft::aligned_free(output);
    #endif

    std::cout << "---- end ----" << "\n";

    return 0;
}

I compile the PFFFT version by entering the following in PowerShell:

g++ -mavx2 -mfma a.cpp -o a.exe -L../../../lib/pffft-master/build -lpffft -DDEFINE_PFFFT_MODE

I compile the chowdsp_fft version by entering the following in PowerShell:

g++ -mavx2 -mfma a.cpp -o a.exe -L../../../lib/chowdsp_fft-main/build -lchowdsp_fft -lchowdsp_fft_avx

When executed successfully, it outputs like this:

---- start ----
1.62004
---- end ----

The PFFFT version always runs successfully, but the chowdsp_fft version crashes about half of the time, outputting only ---- start ----. This problem doesn't occur when using chowdsp::fft::FFT_COMPLEX.

I tried to fix it myself and found a strange solution, but I really don't understand why this prevents the crash. I removed all pffft_real_finalize_8x8 and wrote it directly like this:

@@ chowdsp_fft_impl_avx.cpp @@
-   pffft_real_finalize_8x8 (zero, zero, in + 1, e, out);
+   {
+       __m256 in0_ = zero;
+       __m256 in1_ = zero;
+       const __m256* in_ = in + 1;
+       const __m256* e_ = e;
+       __m256* out_ = out;
+
+       __m256 r0, i0, r1, i1, r2, i2, r3, i3, r4, i4, r5, i5, r6, i6, r7, i7;
+       __m256 sr00, dr00, sr01, dr01, sr10, dr10, sr11, dr11, si00, di00, si01, di01, si10, di10, si11, di11;
+       __m256 r00, i00, r01, i01, r02, i02, r03, i03, r10, i10, r11, i11, r12, i12, r13, i13, r11_m, i11_m, r13_m, i13_m;
+       r0 = in0_;
+       i0 = in1_;
+       r1 = *in_++;
+       i1 = *in_++;
+       r2 = *in_++;
+       i2 = *in_++;
+       r3 = *in_++;
+       i3 = *in_++;
+       r4 = *in_++;
+       i4 = *in_++;
+       r5 = *in_++;
+       i5 = *in_++;
+       r6 = *in_++;
+       i6 = *in_++;
+       r7 = *in_++;
+       i7 = *in_++;
+       transpose8 (r0, r1, r2, r3, r4, r5, r6, r7);
+       transpose8 (i0, i1, i2, i3, i4, i5, i6, i7);
+
+       cplx_mul_v (r1, i1, e_[0], e_[1]);
+       cplx_mul_v (r2, i2, e_[2], e_[3]);
+       cplx_mul_v (r3, i3, e_[4], e_[5]);
+       cplx_mul_v (r4, i4, e_[6], e_[7]);
+       cplx_mul_v (r5, i5, e_[8], e_[9]);
+       cplx_mul_v (r6, i6, e_[10], e_[11]);
+       cplx_mul_v (r7, i7, e_[12], e_[13]);
+
+       sr00 = _mm256_add_ps (r0, r4);
+       dr00 = _mm256_sub_ps (r0, r4);
+       sr01 = _mm256_add_ps (r2, r6);
+       dr01 = _mm256_sub_ps (r6, r2);
+       sr10 = _mm256_add_ps (r1, r5);
+       dr10 = _mm256_sub_ps (r1, r5);
+       sr11 = _mm256_add_ps (r3, r7);
+       dr11 = _mm256_sub_ps (r7, r3);
+
+       si00 = _mm256_add_ps (i0, i4);
+       di00 = _mm256_sub_ps (i0, i4);
+       si01 = _mm256_add_ps (i2, i6);
+       di01 = _mm256_sub_ps (i6, i2);
+       si10 = _mm256_add_ps (i1, i5);
+       di10 = _mm256_sub_ps (i1, i5);
+       si11 = _mm256_add_ps (i3, i7);
+       di11 = _mm256_sub_ps (i7, i3);
+
+       r00 = _mm256_add_ps (sr00, sr01);
+       i00 = _mm256_add_ps (si00, si01);
+       r01 = _mm256_add_ps (dr00, di01);
+       i01 = _mm256_sub_ps (dr01, di00);
+       r02 = _mm256_sub_ps (sr00, sr01);
+       i02 = _mm256_sub_ps (si00, si01);
+       r03 = _mm256_sub_ps (dr00, di01);
+       i03 = _mm256_add_ps (dr01, di00);
+
+       r10 = _mm256_add_ps (sr10, sr11);
+       i10 = _mm256_add_ps (si10, si11);
+       r11 = _mm256_add_ps (dr10, di11);
+       i11 = _mm256_sub_ps (dr11, di10);
+       r12 = _mm256_sub_ps (sr11, sr10);
+       i12 = _mm256_sub_ps (si11, si10);
+       r13 = _mm256_sub_ps (dr10, di11);
+       i13 = _mm256_add_ps (dr11, di10);
+
+       r11_m = mul_scalar (_mm256_add_ps (r11, i11), M_SQRT1_2);
+       i11_m = mul_scalar (_mm256_sub_ps (i11, r11), M_SQRT1_2);
+       r13_m = mul_scalar (_mm256_sub_ps (i13, r13), M_SQRT1_2);
+       i13_m = mul_scalar (_mm256_add_ps (r13, i13), -M_SQRT1_2);
+
+       r0 = _mm256_add_ps (r00, r10);
+       i0 = _mm256_add_ps (i00, i10);
+       r4 = _mm256_sub_ps (r02, i12);
+       i4 = _mm256_add_ps (i02, r12);
+       r2 = _mm256_sub_ps (r03, i13_m);
+       i2 = _mm256_add_ps (r13_m, i03);
+       r6 = _mm256_sub_ps (r01, r11_m);
+       i6 = _mm256_sub_ps (i11_m, i01);
+       r1 = _mm256_add_ps (r01, r11_m);
+       i1 = _mm256_add_ps (i01, i11_m);
+       r5 = _mm256_add_ps (r03, i13_m);
+       i5 = _mm256_sub_ps (r13_m, i03);
+       r3 = _mm256_add_ps (r02, i12);
+       i3 = _mm256_sub_ps (r12, i02);
+       r7 = _mm256_sub_ps (r00, r10);
+       i7 = _mm256_sub_ps (i10, i00);
+
+       *out_++ = r0;
+       *out_++ = i0;
+       *out_++ = r1;
+       *out_++ = i1;
+       *out_++ = r2;
+       *out_++ = i2;
+       *out_++ = r3;
+       *out_++ = i3;
+       *out_++ = r4;
+       *out_++ = i4;
+       *out_++ = r5;
+       *out_++ = i5;
+       *out_++ = r6;
+       *out_++ = i6;
+       *out_++ = r7;
+       *out_++ = i7;
+   }

I think SIMD complex and unordered convolution might look something like this (I might be wrong):

void zconvolve(uint32_t Nfloat,const float* a,const float* b,float* ab,float scaling) {
    const __m256 *va = reinterpret_cast<const __m256 *>(a);
    const __m256 *vb = reinterpret_cast<const __m256 *>(b);
    __m256 *vab = reinterpret_cast<__m256 *>(ab);
    __m256 vscaling = _mm256_set1_ps(scaling);

    for (uint32_t i = 0;i < Nfloat;i += 16) {
        __m256 var = *va++;
        __m256 vai = *va++;
        __m256 vbr = *vb++;
        __m256 vbi = *vb++;

        *vab++ =
            _mm256_mul_ps(
                _mm256_fmsub_ps(
                    var,
                    vbr,
                    _mm256_mul_ps(vai,vbi)
                ),
                vscaling
            );

        *vab++ =
            _mm256_mul_ps(
            _mm256_fmadd_ps(
                    var,
                    vbi,
                    _mm256_mul_ps(vai,vbr)
                ),
                vscaling
            );
    }
}
jatinchowdhury18 commented 1 month ago

Thanks for the comprehensive error report!

I tried building the original code that you shared, and it worked fine on my end. I was able to run the app 10 times without crashing. I was building with the MSVC compiler, so my guess is that the crash is happening because of some discrepancy between MSVC and the MINGW gcc compiler that you're using. The fact that writing the finalize_8x8 method inline seems to fix the crash is interesting... I don't have a theory for why that would be at the moment.

Out of curiosity, could you try calling fft_new_setup with use_avx_if_available set to false? Just trying to see if the same crash happens with the SSE implementation.

Adding methods for the unordered FFT and convolution is a good idea as well. I'm planning to spend some time this week cleaning up and publishing my test code, so I probably won't get around to adding those methods until next week, but if you feel comfortable with your implementations, feel free to make a pull request! If you'd like to discuss those methods further, please open a separate issue, so we can keep this one focused on the crash.

lewloiwc commented 1 month ago

I tried various things today as well, but so far these are the findings:

jatinchowdhury18 commented 1 month ago

Thanks for the additional information!

After a bit more research, it seems that there may be a fundamental compatibility issue with MINGW. Apparently MINGW doesn't support 32-byte stack alignment, which is necessary for the AVX FFT implementation to work reliably. I was reading about it in a series of Stack Overflow issues, starting here, but I haven't been able to find very much recent information on this issue.

I guess I might need to add a compiler flag to block the AVX code from compiling on MINGW... Of course, I imagine you were hoping to use the AVX implementation for a performance boost, so that's probably not the answer you were hoping for :(.

If anyone has additional information on getting AVX intrinsics reliably working on MINGW, I'd be very interested to hear it!

lewloiwc commented 1 month ago

I understand. So there was a compatibility issue with MINGW.

While this is likely a very uncertain and unstable workaround, as mentioned in the Stack Overflow answer, I've managed to get it working by adding __attribute__((always_inline)) to force inlining of the pffft_real_finalize_8x8 method, which is currently causing the problem. I think I'll personally use this method until a proper solution is found!