Optional limit peak search

ErichZimmer commented 2 years ago

Enable an option for limiting peak search area for interrogation window lengths/widths greater than 12 pixels. The limited peak search should follow the one-quarter rule, similar to what Fluere does. A side effect of performing said operation is that less peaks have to be sorted and dealt with, making the overall peak finding algorithm noticably faster.

timdewhirst commented 2 years ago

Implemented in example/process by passing a contracted image_view to the peak search routine (image_views implement the same interface as image however are a view onto the raw data held by image).

Test: process 1k x 1k images with 32x32 interrogation area, 2x2px shift giving 247009 vectors; tests run four times on macbook pro M1PRO with 8 cores:

search over full interrogation area: 10.8us/interrogation area
search over reduced interrogation area: 9us/interrogation area

giving a 10% improvement.

ErichZimmer commented 2 years ago

Huh, that was less than what I expected. On my version of a peak searching method that uses three nested loops (one for number of peaks, the other two for y and x axis), it provided anywhere between 15% to 72% improvement (depends on size of search area).

timdewhirst commented 2 years ago

To be clear: the timings above are very crude and are the result of running time on the example process application and so includes everything: image loading, grid building, correlation, peak finding, sub-pixel estimation and writing to a CSV output file. Incidentally, it looks like the M1PRO is about 13x faster than my previous i7-1065. The improvement for the peak detection I would expect to be ~50% improvement as I suspect the peak find is going to be roughly O(n) pixels inspected and 1/4 number of sub image extractions.

ErichZimmer commented 2 years ago

If you are interested, here is the code for my peak searching method, which is faster than core::find_peaks for 6 or less peaks.

using namespace openpiv;

struct is_peak_struct{
    uint32_t h = 0;
    uint32_t w = 0;
    core::g_f val = 0.0;
    bool ispeak = false;
};

core::peaks_t<core::g_f> find_peaks_brute( const core::gf_image& im, uint16_t num_peaks, uint32_t peak_radius )
{
    core::peaks_t<core::g_f> result;
    const uint32_t result_w = 2*peak_radius + 1;
    const uint32_t result_h = result_w;

    // create vector of peak indexes    
    auto bl = im.rect().bottomLeft();

    core::g_f previous_max = 9999999999.99;
    is_peak_struct temp_peak;
    temp_peak.ispeak = false;

    for (auto i=num_peaks; i>0; --i)
    {
        for ( uint32_t h=peak_radius; h<im.height()-2*peak_radius; ++h )
        {
            const core::g_f* above = im.line( h-1 );
            const core::g_f* line = im.line( h );
            const core::g_f* below = im.line( h+1 );

            for ( uint32_t w=peak_radius; w<im.width()-peak_radius; ++w )
            {
                if ( line[w-1] < line[w] && line[w+1] < line[w] && above[w] < line[w]  && below[w] < line[w] && // is local peak?
                     line[w] > temp_peak.val && line[w] < previous_max) // is correct local peak?
                {
                    temp_peak.h = h;
                    temp_peak.w = w;
                    temp_peak.val = line[w];
                    temp_peak.ispeak = true;
                }
            }
        }

        if (!temp_peak.ispeak)
            break;
        else
        {
            result.emplace_back(
                core::extract(
                    im,
                    core::rect( {bl[0] + temp_peak.w - peak_radius, bl[1] + temp_peak.h - peak_radius}, {result_w, result_h} )
                )
            );

            previous_max = temp_peak.val;
            temp_peak.val = 0.0;
            temp_peak.ispeak = false; 
        }
    }
    // result.resize(num_peaks);
    return result;
}

timdewhirst commented 2 years ago

Interesting... and I think this demonstrates one of the classic issues with optimizing code: from inspection I wouldn't have automatically guessed this would be faster as you're traversing the image data num_peaks times i.e. for an nxn interrogation area you're doing ~num_peaksnn reads... so what's going on here? I guess because the image data is small (1024bytes for 32x32 8bit image) is easily cacheable so there's effectively minimal extra cost for multiple data traversals.

I wonder how it performs with a larger interrogation area?

Anyway, it looks like a good alternative, so please fork, add you change plus any tests then raise a PR.

Thanks!

ErichZimmer commented 2 years ago

I did a quick test for an image of 4000x6000 px.

Interrogation window size = 256x256 Overlap = 50%

benchmark for num_peaks = 3, thread_count = 1: find_peaks: 4.96 s ± 132 ms find_peaks_brute 2.99 s ± 56.6 ms

benchmark for num_peaks = 3, thread_count = 4: find_peaks: 2.46 s ± 71 ms find_peaks_brute 1.78 s ± 120 ms

benchmark for num_peaks = 3, thread_count = 4, limited peak search area: find_peaks: 538 ms ± 24.9 ms find_peaks_brute 461 ms ± 57.8 ms

Interrogation window size = 512x512 Overlap = 50%

benchmark for num_peaks = 3, thread_count = 4, limited peak search area: find_peaks: 132 ms ± 13.6 ms find_peaks_brute 446 ms ± 90 ms

It appears that you are correct about the interrogation windows being easily cached until a certain size and its impact on performance.

ErichZimmer commented 2 years ago

Surprisingly, both are faster than the demo versions of commercial software and Fluere (find_peaks being ~ 6x faster and find_peaks_brute being ~ 2x faster when interrogation windows are 512x512 px)

ErichZimmer commented 1 year ago

Do we still want this "brute force" algorithm for locating peaks (if so, then I'll allocate some time soon)? I find that when using a limited peak search area (which I believe should always be used) there is at most a 15% increase in performance with the new algorithm.

timdewhirst commented 1 year ago

Any and all improvements are more than welcome - please fork & raise a PR! Thanks!

OpenPIV / openpiv-c--qt

Optional limit peak search #27