Closed ErichZimmer closed 2 years ago
Implemented in example/process by passing a contracted image_view to the peak search routine (image_views implement the same interface as image however are a view onto the raw data held by image).
Test: process 1k x 1k images with 32x32 interrogation area, 2x2px shift giving 247009 vectors; tests run four times on macbook pro M1PRO with 8 cores:
giving a 10% improvement.
Huh, that was less than what I expected. On my version of a peak searching method that uses three nested loops (one for number of peaks, the other two for y and x axis), it provided anywhere between 15% to 72% improvement (depends on size of search area).
To be clear: the timings above are very crude and are the result of running time
on the example process application and so includes everything: image loading, grid building, correlation, peak finding, sub-pixel estimation and writing to a CSV output file. Incidentally, it looks like the M1PRO is about 13x faster than my previous i7-1065. The improvement for the peak detection I would expect to be ~50% improvement as I suspect the peak find is going to be roughly O(n) pixels inspected and 1/4 number of sub image extractions.
If you are interested, here is the code for my peak searching method, which is faster than core::find_peaks
for 6 or less peaks.
using namespace openpiv;
struct is_peak_struct{
uint32_t h = 0;
uint32_t w = 0;
core::g_f val = 0.0;
bool ispeak = false;
};
core::peaks_t<core::g_f> find_peaks_brute( const core::gf_image& im, uint16_t num_peaks, uint32_t peak_radius )
{
core::peaks_t<core::g_f> result;
const uint32_t result_w = 2*peak_radius + 1;
const uint32_t result_h = result_w;
// create vector of peak indexes
auto bl = im.rect().bottomLeft();
core::g_f previous_max = 9999999999.99;
is_peak_struct temp_peak;
temp_peak.ispeak = false;
for (auto i=num_peaks; i>0; --i)
{
for ( uint32_t h=peak_radius; h<im.height()-2*peak_radius; ++h )
{
const core::g_f* above = im.line( h-1 );
const core::g_f* line = im.line( h );
const core::g_f* below = im.line( h+1 );
for ( uint32_t w=peak_radius; w<im.width()-peak_radius; ++w )
{
if ( line[w-1] < line[w] && line[w+1] < line[w] && above[w] < line[w] && below[w] < line[w] && // is local peak?
line[w] > temp_peak.val && line[w] < previous_max) // is correct local peak?
{
temp_peak.h = h;
temp_peak.w = w;
temp_peak.val = line[w];
temp_peak.ispeak = true;
}
}
}
if (!temp_peak.ispeak)
break;
else
{
result.emplace_back(
core::extract(
im,
core::rect( {bl[0] + temp_peak.w - peak_radius, bl[1] + temp_peak.h - peak_radius}, {result_w, result_h} )
)
);
previous_max = temp_peak.val;
temp_peak.val = 0.0;
temp_peak.ispeak = false;
}
}
// result.resize(num_peaks);
return result;
}
Interesting... and I think this demonstrates one of the classic issues with optimizing code: from inspection I wouldn't have automatically guessed this would be faster as you're traversing the image data num_peaks times i.e. for an nxn interrogation area you're doing ~num_peaksnn reads... so what's going on here? I guess because the image data is small (1024bytes for 32x32 8bit image) is easily cacheable so there's effectively minimal extra cost for multiple data traversals.
I wonder how it performs with a larger interrogation area?
Anyway, it looks like a good alternative, so please fork, add you change plus any tests then raise a PR.
Thanks!
I did a quick test for an image of 4000x6000 px.
Interrogation window size = 256x256 Overlap = 50%
benchmark for num_peaks = 3, thread_count = 1:
find_peaks
: 4.96 s ± 132 ms
find_peaks_brute
2.99 s ± 56.6 ms
benchmark for num_peaks = 3, thread_count = 4:
find_peaks
: 2.46 s ± 71 ms
find_peaks_brute
1.78 s ± 120 ms
benchmark for num_peaks = 3, thread_count = 4, limited peak search area:
find_peaks
: 538 ms ± 24.9 ms
find_peaks_brute
461 ms ± 57.8 ms
Interrogation window size = 512x512 Overlap = 50%
benchmark for num_peaks = 3, thread_count = 4, limited peak search area:
find_peaks
: 132 ms ± 13.6 ms
find_peaks_brute
446 ms ± 90 ms
It appears that you are correct about the interrogation windows being easily cached until a certain size and its impact on performance.
Surprisingly, both are faster than the demo versions of commercial software and Fluere (find_peaks
being ~ 6x faster and find_peaks_brute
being ~ 2x faster when interrogation windows are 512x512 px)
Do we still want this "brute force" algorithm for locating peaks (if so, then I'll allocate some time soon)? I find that when using a limited peak search area (which I believe should always be used) there is at most a 15% increase in performance with the new algorithm.
Any and all improvements are more than welcome - please fork & raise a PR! Thanks!
Enable an option for limiting peak search area for interrogation window lengths/widths greater than 12 pixels. The limited peak search should follow the one-quarter rule, similar to what Fluere does. A side effect of performing said operation is that less peaks have to be sorted and dealt with, making the overall peak finding algorithm noticably faster.