multithreaded code and fftwf_destroy_plan() SIGSEGV

kiplingw commented 5 years ago

I have a number of hardware threads executing simultaneously on different unrelated data. That is, they are not operating concurrently on different parts of the same data buffer.

Each thread allocates its own buffer via fftwf_malloc().
Wisdom is loaded from disk via fftwf_import_wisdom_from_filename().
A plan is initialized for calculating via the halfcomplex format via fftwf_plan_r2r_1d(..., ..., ..., FFTW_R2HC, FFTW_EXHAUSTIVE).
If wisdom was not loaded from disk, it is saved via fftwf_export_wisdom_to_filename().
After the above construction steps are completed, the transform is repeatedly updated in a loop as new signal data is fed into the thread's input window and fftwf_execute() is called after each update.
After the work loop is completed, the signal data is freed via fftwf_free(), the plan is cleaned up with fftwf_destroy_plan(), and lastly fftwf_cleanup().

This works fine for a single thread, but as soon as I have multiple threads performing the above, the cleanup fails on the fftwf_destroy_plan() raising a SIGSEGV.

I'm aware FFTW has some functions to assist with thread safety, but the documentation I saw wasn't clear to me on what to do in the scenario where each thread is operating on totally independent data and plans.

I am using my distro's fftw 3.3.8-2 on amd64 with single precision floating point.

matteo-frigo commented 5 years ago

(It would be better to see code instead of a description in English.)

It sounds like you are computing plans in parallel. The FFTW planner is not threads safe, you must call it only from one thread at the time.

kiplingw commented 5 years ago

Here is an example as the header:

// Multiple include protection...
#ifndef _FAST_FOURIER_TRANSFORM_H_
#define _FAST_FOURIER_TRANSFORM_H_

// Includes...

    // Fastest Fourier Transform in the West...
    #include <fftw3.h>

    // Standard C++ and POSIX headers...
    #include <memory>
    #include <vector>

// Perform a real to real single dimension discrete Fourier transform...
class FastFourierTransform
{
    // Public methods...
    public:

        // Disable copy constructor for now...
        FastFourierTransform(const FastFourierTransform &) = delete;

        // Construct using given size for both input and transformation...
        FastFourierTransform(const size_t WindowSize);

        // Get the calculated transform...
        const std::vector<float> &GetTransform() const { return m_Transform; }

        // Disable assignment for now...
        FastFourierTransform &operator==(const FastFourierTransform &) = delete;

        // Update the transform based on new signal data in the input window...
        void UpdateWindow(const std::vector<float> &Window);

        // Deconstructor...
       ~FastFourierTransform();

    // Protected constants...
    protected:

        // Size of the input signal window...
        const size_t                            m_WindowSize;

    // Protected data...
    protected:

        // Hanning window...
        std::vector<float>                      m_HanningWindow;

        // Fast fourier transform plan...
        fftwf_plan                              m_Plan;

        // Memory aligned signal window...
        float                                  *m_Signal;

        // Output window...
        std::vector<float>                      m_Transform;

        // Path to FFTW optimization wisdom file...
        std::string                             m_WisdomPath;
};

// Multiple include protection...
#endif

And here is the example implementation:

// Construct using given size for both input and transformation...
FastFourierTransform::FastFourierTransform(const size_t WindowSize)
    : m_WindowSize(WindowSize),
      m_HanningWindow(WindowSize),
      m_Plan(nullptr),
      m_Signal(nullptr),
      m_Transform(WindowSize)
{
    // Don't be wasteful...
    m_HanningWindow.shrink_to_fit();
    m_Transform.shrink_to_fit();

    // Be verbose...
    P_VERBOSE_MSG(_("initializing discrete Fourier transform"));

    // Allocate a memory aligned internal buffer for the input signal window...
    m_Signal = static_cast<float *>(::fftwf_malloc(sizeof(float) * WindowSize));

    // Check to see if platform optimization results are cached...

        // Flag on whether cached wisdom was loaded from disk...
        bool WisdomLoaded = false;

        // Get path...
        m_WisdomPath = ::getenv("HOME");
        m_WisdomPath += "/.my_fftw_wisdom";

        // Try to use fixed cached wisdom from disk if this is a unit test to
        //  not break unit tests...
        if(::IsUnitTest())
        {
            // Be verbose...
            P_INFO_MSG(_("using fixed cached wisdom for unit tests..."));

            // Fixed cached wisdom from my test bed...
            const char *Wisdom = \
            "(fftw-3.3.8 fftwf_wisdom #x9e7d4dee #xdb14fed1 #x34bf76a4 #xeb6e8fdf"
            "  (fftwf_codelet_r2cf_128 0 #x1040 #x1040 #x0 #xed7300fd #x3557566c #x686470f8 #xe50104ed)"
            "  (fftwf_codelet_r2cfII_8 0 #x1040 #x1040 #x0 #xa6f06664 #x40ccbe2f #x8d0066f2 #xeafb3f91)"
            "  (fftwf_codelet_r2cf_8 0 #x1040 #x1040 #x0 #xdf5b1846 #xf83b7496 #xf3fed22e #x323f2c44)"
            "  (fftwf_codelet_hf2_8 0 #x1040 #x1040 #x0 #x6a1f82d9 #x9eeab51c #x151d4af4 #xe5c40556)"
            ")";

            // Import...
            if(!::fftwf_import_wisdom_from_string(Wisdom))
                P_ERROR_MSG(_("unable to import fixed wisdom cache..."));
        }

        // Try to load from disk if this is not a unit test...
        else
        {
            // Import...
            WisdomLoaded = ::fftwf_import_wisdom_from_filename(m_WisdomPath.c_str());

                // Failed...
                if(!WisdomLoaded)
                    P_WARNING_MSG(_("no optimization cache available, generating..."));
        }

    // Initialize a plan to calculate the Fourier transform and probe for the
    //  fastest method on this platform. We are calculating via the halfcomplex
    //  format: http://www.fftw.org/doc/The-Halfcomplex_002dformat-DFT.html
    m_Plan = ::fftwf_plan_r2r_1d(
        WindowSize, m_Signal, m_Transform.data(), FFTW_R2HC, FFTW_EXHAUSTIVE);

        // Failed...
        if(!m_Plan)
            throw runtime_error(_("unable to initialize discrete Fourier transform"));

    // Save the optimization hints if they weren't already and this isn't a unit
    //  test...
    if(!WisdomLoaded && !::IsUnitTest())
    {
        // Try to save. If you can't, show a warning. This can and does happen
        //  within sbuilder...
        if(::fftwf_export_wisdom_to_filename(m_WisdomPath.c_str()) == 0)
            P_WARNING_MSG(_("unable to save Fourier transform optimization hints"));
    }

    // Initialize the Hanning window...
    const float HanningAngle = (1.0 / m_WindowSize) * 2.0 * M_PI;
    for(size_t Index = 0; Index < m_HanningWindow.size(); ++Index)
        m_HanningWindow[Index] = 0.5 - 0.5 * ::cos(HanningAngle * Index);
}

// Update the transform based on new signal data in the input window...
void FastFourierTransform::UpdateWindow(const std::vector<float> &Window)
{
    // Input window should be size we constructed with...
    assert(Window.size() == m_WindowSize);

    // Load the new signal window into the memory aligned internal buffer...
    copy(Window.begin(), Window.end(), m_Signal);

    // Scale by the Hanning curve...
    for(size_t Index = 0; Index < m_WindowSize; ++Index)
        m_Signal[Index] *= m_HanningWindow.at(Index);

    // Perform the transform...
  ::fftwf_execute(m_Plan);

    // The returned array contains the sum of all bins at the first index, the
    //  remaining being a mirror image split in the middle. I am not sure what
    //  Bashi is intending to do in squaring the elements...
    m_Transform.at(0) *= m_Transform.at(0);
    for(size_t Index = 1; Index < (m_WindowSize / 2); ++Index)
        m_Transform.at(Index) = 
            ::pow(m_Transform.at(Index), 2) +
            ::pow(m_Transform.at(m_WindowSize - Index), 2);
}

// Deconstructor...
FastFourierTransform::~FastFourierTransform()
{
    // Clean up the input buffer...
  ::fftwf_free(m_Signal);
    m_Signal = nullptr;

    // Clean up the plan... (SIGSEGV here)
  ::fftwf_destroy_plan(m_Plan);
    m_Plan = nullptr;

    // ...and all accumulated optimization wisdom...
  ::fftwf_cleanup();
}

The only solution that I've been able to come up with is a global mutex across all threads such that only one object can exist at a time with the rest blocking while they wait to obtain the global FFTW lock. This works, but it is obviously not efficient.

When you say the planner is not thread safe, I recall reading that in the documentation. I took that to mean reusing the same planner in different threads is not safe, but I'm guessing that all planners apparently share some global state?

matteo-frigo commented 5 years ago

There is only one planner, which is not thread-safe. I.e., you are responsible for locking it.

stevengj commented 5 years ago

As it says in the manual:

the only thread-safe routine in FFTW is fftw_execute

kiplingw commented 5 years ago

There is only one planner, which is not thread-safe. I.e., you are responsible for locking it.

Thanks @matteo-frigo, but I found the manual to not be clear:

So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread.

The wording suggests that it is possible to have multiple plans.

All I am trying to do is analyse unrelated data in separate threads. What would you suggest?

stevengj commented 5 years ago

All I am trying to do is analyse unrelated data in separate threads. What would you suggest?

The simplest thing is to call fftw_make_planner_thread_safe() which puts a lock around the non-threadsafe plan creation and destruction calls. Or you can use your own mutex lock.

This serializes the plan creation, of course, but plan execution is still in parallel, and hopefully that is the time-critical part for you.

kiplingw commented 5 years ago

Ok, but I'm still not clear. Do you mean I can have multiple plans at once, with plan creation / destruction synchronized across threads, or do you mean I can have only one plan at a given time and if I want more, they must be created / destroyed in serial?

Also, I seem to have figured something out. If I don't call fftwf_cleanup() in the destructor of each thread after destroying it's plan, my crash disappears. From the documentation I see:

After calling fftw_cleanup, all existing plans become undefined, and you should not attempt to execute them nor to destroy them.

stevengj commented 5 years ago

You can create as many plans as you want across as many threads as you want, as long as plan creation/destruction is protected by a lock so that it occurs in serial.

(Most programs shouldn't call fftw_cleanup at all, since the memory will be deallocated much more efficiently by the operating system once your program exits.)

kiplingw commented 5 years ago

Thank you so much for clarifying @stevengj

As for fftw_cleanup, when you say the OS will cleanup for you at exit, do you mean to say that the library will not cleanup allocated resources on its own, but it's expected the OS will recover the memory? If that's the case, I think it would be better for the program to call fftw_cleanup on exit so that they don't run into issues with valgrind while trying to find memory leaks.

stevengj commented 5 years ago

One reason to call fftw_cleanup is for things like valgrind, but that is for debugging — you wouldn't normally leave it in production code, because letting the OS clean up at exit is more efficient. This is especially true for multithreaded code where you have to be careful to cleanup only after all threads are done with FFTW.

kiplingw commented 5 years ago

Thanks @stevengj. When you say "clean up", do you mean the OS freeing memory the library allocated on the heap without calling free(3)?

stevengj commented 5 years ago

Yes.

FFTW / fftw3

multithreaded code and fftwf_destroy_plan() SIGSEGV #176