bfraboni / FastGaussianBlur

Fast Gaussian Blur algorithm
97 stars 18 forks source link

STD::THREADS + some improvements in flip_block #9

Open michelerenzullo opened 1 year ago

michelerenzullo commented 1 year ago

Hi, long time I contributed when we discussed about reflection padding and cache improvements.

I'm working on some web assembly projects, pthreads etc... And I want to contribute again, sometimes linking with OpenMP is messy or impossible on some platforms, therefore we can implement a posix standard, where the performance are totally equivalent since the code is quite simple and "the power of openmp" doesn't justify the complexity or missing implementation on WebAssembly, Android or again difficulties on Mac (for example here, OpenCV defaults to TBB or pthreads, same as iOS)

I wrote a snippet and tested properly where we can switch mode, please feel free to test it and implement it eventually. Further I improved the performance of flip_block in order to have exactly the same when using threads, this transpose function was the reason of my request, as it is now can causes performance issues when not using openmp that is smart enough to collapse it, and spread equally amongst the threads despite our increments of +=block

template <typename T, typename op>
void hybrid_loop(T end, op operation)
{
    auto operation_wrapper = [&](T i, int tid = 0) {
        if constexpr (std::is_invocable_v<op, T>) operation(i);
        else operation(i, tid);
    };
#ifdef __SINGLE__
    for (T i = 0; i < end; ++i) operation_wrapper(i);
#elif __OMP__
#pragma omp parallel for
    for (T i = 0; i < end; ++i) operation_wrapper(i, omp_get_thread_num());
#elif __STD_THREADS__
    const int num_threads = std::thread::hardware_concurrency();
    // Split in block equally for each thread. ex: 3 threads, start = 0, end = 8
        // Thread 1: 0,1,2
        // Thread 2: 3,4,5
        // Thread 3: 6,7
    const T block_size = ((end + num_threads - 1) / num_threads);
    std::vector<std::thread> threads;

    for (int tid = 0; tid < num_threads; ++tid) {
        threads.emplace_back([=]() {
                T block_start = tid * block_size;
            T block_end = std::min<T>(block_start + block_size, end);

        for (T i = block_start; i < block_end; ++i) operation_wrapper(i, tid); });
    }

    for (auto &thread : threads) thread.join();
#endif

my flip block:

template <typename T, int C>
void flip_block(const T *in, T *out, const int w, const int h)
{
    // Suppose a square block of L2 cache size = 256KB
    // to be divided for the num of channels and bytes
    const int block = sqrt(262144.0 / (C * sizeof(T)));   // <-- note sqrt and also sizeof(T)
    const int w_blocks = std::ceil(static_cast<float>(w) / block);
    const int h_blocks = std::ceil(static_cast<float>(h) / block);

    hybrid_loop(w_blocks * h_blocks, [&](int n) {
            int x = (n / h_blocks) * block;
            int y = (n % h_blocks) * block;
            const T *p = in + y * w * C + x * C;
            T *q = out + y * C + x * h * C;

            const int blockx = std::min(w, (x + block)) - x ;
            const int blocky = std::min(h, (y + block)) - y ;
            for (int xx = 0; xx < blockx; xx++)
            {
                for (int yy = 0; yy < blocky; yy++)
                {
                    for (int k = 0; k < C; k++)
                        q[k] = p[k];
                    p += w * C;
                    q += C;
                }
                p += -blocky * w * C + C;
                q += -blocky * C + h * C;
            }
        });
}

I think that your current flip block might be wrong in the cache calculation because

Feel free to disagree or add thoughts, but if you can test with my snippets (hybrid loop with -D__STD_THREADS__ vs -D__OMP__) and compare with your current code, perhaps might more clear than read my verbose notes.

Not relevant but if you have a look also at my deinterleave and interleave to have a better pic of this cache friendly operations and my opinion when dealing with only 1 dimension at time, so kinda easier...

template<typename T, typename U>
void deinterleave_BGR(const T* const interleaved_BGR, U** const deinterleaved_BGR, const uint32_t total_size) {

    // Cache-friendly deinterleave BGR, splitting for blocks of 256 KB, inspired by flip-block
    constexpr float round = std::is_integral_v<U> ? std::is_integral_v<T> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);

    hybrid_loop(num_blocks, [&](auto n) {
        const uint32_t x = n * block;
        U* const B = deinterleaved_BGR[0] + x;
        U* const G = deinterleaved_BGR[1] + x;
        U* const R = deinterleaved_BGR[2] + x;
        const T* const interleaved_ptr = interleaved_BGR + x * 3;

        const int blockx = std::min(total_size, x + block) - x;
        for (int xx = 0; xx < blockx; ++xx)
        {
            B[xx] = interleaved_ptr[xx * 3 + 0] + round;
            G[xx] = interleaved_ptr[xx * 3 + 1] + round;
            R[xx] = interleaved_ptr[xx * 3 + 2] + round;
        }
    });

}

template<typename T, typename U>
void interleave_BGR(const U** const deinterleaved_BGR, T* const interleaved_BGR, const uint32_t total_size) {

    constexpr float round = std::is_integral_v<T> ? std::is_integral_v<U> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);

    hybrid_loop(num_blocks, [&](auto n) {
        const uint32_t x = n * block;
        const U* const B = deinterleaved_BGR[0] + x;
        const U* const G = deinterleaved_BGR[1] + x;
        const U* const R = deinterleaved_BGR[2] + x;
        T* const interleaved_ptr = interleaved_BGR + x * 3;

        const int blockx = std::min(total_size, x + block) - x;
        for (int xx = 0; xx < blockx; ++xx)
        {
            interleaved_ptr[xx * 3 + 0] = B[xx] + round;
            interleaved_ptr[xx * 3 + 1] = G[xx] + round;
            interleaved_ptr[xx * 3 + 2] = R[xx] + round;
        }
    });

}

Example of deinterleave_BGR:

    std::vector<std::vector<float>> temp(3, std::vector<float>(sizes[0] * sizes[1]));
    float* BGR[3] = { temp[0].data(), temp[1].data(), temp[2].data() };
    deinterleave_BGR((const uint8_t*)padded.get(), BGR, sizes[0] * sizes[1]);
bfraboni commented 3 months ago

Hi @michelerenzullo , I forgot about this one and plan on doing a bit of update in the repo soon, so good time to discuss.

Thanks for the suggestions, I agree that OpenMP portability is not there yet, though I still feel it is the best readable option. I can try and make a parallel_for( ... ) generic version later based on yours.

However the choice of block size shouldn't start with the total L2 cache size, because the block itself is not contiguous in memory, it is way to large to avoid cache misses. Ideally all the image lines that are part of a block should fit in the L2 cache to have zero cache misses per block. So I think we could refine a bit your approach to compute the block size and make an even better version. My original choice was a bit arbitrary, but small enough to not witness too many cache misses. I'm curious to compare to your version of block size, but I expect perf degradation somehow. Does that make sense ?