Open michelerenzullo opened 1 year ago
Hi @michelerenzullo , I forgot about this one and plan on doing a bit of update in the repo soon, so good time to discuss.
Thanks for the suggestions, I agree that OpenMP portability is not there yet, though I still feel it is the best readable option. I can try and make a parallel_for( ... ) generic version later based on yours.
However the choice of block size shouldn't start with the total L2 cache size, because the block itself is not contiguous in memory, it is way to large to avoid cache misses. Ideally all the image lines that are part of a block should fit in the L2 cache to have zero cache misses per block. So I think we could refine a bit your approach to compute the block size and make an even better version. My original choice was a bit arbitrary, but small enough to not witness too many cache misses. I'm curious to compare to your version of block size, but I expect perf degradation somehow. Does that make sense ?
Hi, long time I contributed when we discussed about reflection padding and cache improvements.
I'm working on some web assembly projects, pthreads etc... And I want to contribute again, sometimes linking with OpenMP is messy or impossible on some platforms, therefore we can implement a posix standard, where the performance are totally equivalent since the code is quite simple and "the power of openmp" doesn't justify the complexity or missing implementation on WebAssembly, Android or again difficulties on Mac (for example here, OpenCV defaults to TBB or pthreads, same as iOS)
I wrote a snippet and tested properly where we can switch mode, please feel free to test it and implement it eventually. Further I improved the performance of
flip_block
in order to have exactly the same when using threads, this transpose function was the reason of my request, as it is now can causes performance issues when not using openmp that is smart enough to collapse it, and spread equally amongst the threads despite our increments of+=block
my flip block:
I think that your current flip block might be wrong in the cache calculation because
sqrt
in my codeFeel free to disagree or add thoughts, but if you can test with my snippets (hybrid loop with
-D__STD_THREADS__
vs-D__OMP__
) and compare with your current code, perhaps might more clear than read my verbose notes.Not relevant but if you have a look also at my deinterleave and interleave to have a better pic of this cache friendly operations and my opinion when dealing with only 1 dimension at time, so kinda easier...
Example of deinterleave_BGR: