STD::THREADS + some improvements in flip_block

Hi, long time I contributed when we discussed about reflection padding and cache improvements.

I'm working on some web assembly projects, pthreads etc... And I want to contribute again, sometimes linking with OpenMP is messy or impossible on some platforms, therefore we can implement a posix standard, where the performance are totally equivalent since the code is quite simple and "the power of openmp" doesn't justify the complexity or missing implementation on WebAssembly, Android or again difficulties on Mac (for example here, OpenCV defaults to TBB or pthreads, same as iOS)

I wrote a snippet and tested properly where we can switch mode, please feel free to test it and implement it eventually. Further I improved the performance of flip_block in order to have exactly the same when using threads, this transpose function was the reason of my request, as it is now can causes performance issues when not using openmp that is smart enough to collapse it, and spread equally amongst the threads despite our increments of +=block

template <typename T, typename op>
void hybrid_loop(T end, op operation)
{
    auto operation_wrapper = [&](T i, int tid = 0) {
        if constexpr (std::is_invocable_v<op, T>) operation(i);
        else operation(i, tid);
    };
#ifdef __SINGLE__
    for (T i = 0; i < end; ++i) operation_wrapper(i);
#elif __OMP__
#pragma omp parallel for
    for (T i = 0; i < end; ++i) operation_wrapper(i, omp_get_thread_num());
#elif __STD_THREADS__
    const int num_threads = std::thread::hardware_concurrency();
    // Split in block equally for each thread. ex: 3 threads, start = 0, end = 8
        // Thread 1: 0,1,2
        // Thread 2: 3,4,5
        // Thread 3: 6,7
    const T block_size = ((end + num_threads - 1) / num_threads);
    std::vector<std::thread> threads;

    for (int tid = 0; tid < num_threads; ++tid) {
        threads.emplace_back([=]() {
                T block_start = tid * block_size;
            T block_end = std::min<T>(block_start + block_size, end);

        for (T i = block_start; i < block_end; ++i) operation_wrapper(i, tid); });
    }

    for (auto &thread : threads) thread.join();
#endif

my flip block:

template <typename T, int C>
void flip_block(const T *in, T *out, const int w, const int h)
{
    // Suppose a square block of L2 cache size = 256KB
    // to be divided for the num of channels and bytes
    const int block = sqrt(262144.0 / (C * sizeof(T)));   // <-- note sqrt and also sizeof(T)
    const int w_blocks = std::ceil(static_cast<float>(w) / block);
    const int h_blocks = std::ceil(static_cast<float>(h) / block);

    hybrid_loop(w_blocks * h_blocks, [&](int n) {
            int x = (n / h_blocks) * block;
            int y = (n % h_blocks) * block;
            const T *p = in + y * w * C + x * C;
            T *q = out + y * C + x * h * C;

            const int blockx = std::min(w, (x + block)) - x ;
            const int blocky = std::min(h, (y + block)) - y ;
            for (int xx = 0; xx < blockx; xx++)
            {
                for (int yy = 0; yy < blocky; yy++)
                {
                    for (int k = 0; k < C; k++)
                        q[k] = p[k];
                    p += w * C;
                    q += C;
                }
                p += -blocky * w * C + C;
                q += -blocky * C + h * C;
            }
        });
}

I think that your current flip block might be wrong in the cache calculation because

it doesn't consider the data type of the input, suppose we have 1 channel to process and float type (4 bytes), we gonna have 65536 operation for each thread because 65535 * 4 = 262140 bytes = ~ 256KB
Further it doesn't count the bidimensionality, that's why you see sqrt in my code

Feel free to disagree or add thoughts, but if you can test with my snippets (hybrid loop with -D__STD_THREADS__ vs -D__OMP__) and compare with your current code, perhaps might more clear than read my verbose notes.

Not relevant but if you have a look also at my deinterleave and interleave to have a better pic of this cache friendly operations and my opinion when dealing with only 1 dimension at time, so kinda easier...

template<typename T, typename U>
void deinterleave_BGR(const T* const interleaved_BGR, U** const deinterleaved_BGR, const uint32_t total_size) {

    // Cache-friendly deinterleave BGR, splitting for blocks of 256 KB, inspired by flip-block
    constexpr float round = std::is_integral_v<U> ? std::is_integral_v<T> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);

    hybrid_loop(num_blocks, [&](auto n) {
        const uint32_t x = n * block;
        U* const B = deinterleaved_BGR[0] + x;
        U* const G = deinterleaved_BGR[1] + x;
        U* const R = deinterleaved_BGR[2] + x;
        const T* const interleaved_ptr = interleaved_BGR + x * 3;

        const int blockx = std::min(total_size, x + block) - x;
        for (int xx = 0; xx < blockx; ++xx)
        {
            B[xx] = interleaved_ptr[xx * 3 + 0] + round;
            G[xx] = interleaved_ptr[xx * 3 + 1] + round;
            R[xx] = interleaved_ptr[xx * 3 + 2] + round;
        }
    });

}

template<typename T, typename U>
void interleave_BGR(const U** const deinterleaved_BGR, T* const interleaved_BGR, const uint32_t total_size) {

    constexpr float round = std::is_integral_v<T> ? std::is_integral_v<U> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);

    hybrid_loop(num_blocks, [&](auto n) {
        const uint32_t x = n * block;
        const U* const B = deinterleaved_BGR[0] + x;
        const U* const G = deinterleaved_BGR[1] + x;
        const U* const R = deinterleaved_BGR[2] + x;
        T* const interleaved_ptr = interleaved_BGR + x * 3;

        const int blockx = std::min(total_size, x + block) - x;
        for (int xx = 0; xx < blockx; ++xx)
        {
            interleaved_ptr[xx * 3 + 0] = B[xx] + round;
            interleaved_ptr[xx * 3 + 1] = G[xx] + round;
            interleaved_ptr[xx * 3 + 2] = R[xx] + round;
        }
    });

}

Example of deinterleave_BGR:

    std::vector<std::vector<float>> temp(3, std::vector<float>(sizes[0] * sizes[1]));
    float* BGR[3] = { temp[0].data(), temp[1].data(), temp[2].data() };
    deinterleave_BGR((const uint8_t*)padded.get(), BGR, sizes[0] * sizes[1]);

bfraboni / FastGaussianBlur

STD::THREADS + some improvements in flip_block #9