ermig1979 / Simd

C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.
http://ermig1979.github.io/Simd
MIT License
2.03k stars 406 forks source link

UyvyToGray8 convertion #188

Closed trlsmax closed 2 years ago

trlsmax commented 2 years ago

I want to convert a uyvy buffer from v4l2 device to a gray8 View. this function used about 40ms on my RaspberryPi4

UyvyToGray8(uint8_t* src, View& view)
{
     for (size_t row = 0; row < view.height; row++) {
        for (size_t col = 0; col < view.width; col++) {
            view.At<uint8_t>(col, row) = src[(row * view.width + col) * 2 + 1];
        }
    }
}

So I tried simd version:

UyvyToGray8(uint8_t* src, View& view)
{
    int num8x16 = view.width * view.height / 16;
    uint8x16x2_t tmp;
    for (int i = 0; i < num8x16; i++) {
        tmp = vld2q_u8(src + 2 * 16 * i);
        vst1q_u8(view.data + 16 * i, tmp.val[1]);
        // also tried this, fail
        //uint8_t * _p = (uint8_t *)__builtin_assume_aligned(view.data + 16 * i, 16);
        //vst1q_u8(_p, tmp.val[1]);
    }
}

this version cause program crash. Any idea?

ermig1979 commented 2 years ago

In common case YUV to Gray conversion looks like:

g = (Y - 16) * 255 / 220

Or do I mistake?

P.S. Your optimization looks like valid. For example see function Simd::Neon::DeinterleaveUv .

trlsmax commented 2 years ago

You are right. But my application will ok for only Y. And finally come up with something like this:

        template <bool align> void UyvyToGray8(const uint8_t * data, size_t uvStride, size_t width, size_t height, uint8_t * gray, size_t gStride)
        {
            assert(width >= A);
            if (align)
            {
                assert(Aligned(data) && Aligned(uvStride));
            }

            size_t bodyWidth = AlignLo(width, A);
            size_t tail = width - bodyWidth;
            for (size_t row = 0; row < height; ++row)
            {
                for (size_t col = 0, offset = 0; col < bodyWidth; col += A, offset += DA)
                {
                    uint8x16x2_t _uv = Load2<align>(data + offset);
                    Store<align>(gray + col, _uv.val[1]);
                }
                if (tail)
                {
                    size_t col = width - A;
                    size_t offset = 2 * col;
                    uint8x16x2_t _uv = Load2<false>(data + offset);
                    Store<false>(gray + col, _uv.val[1]);
                }
                data += uvStride;
                gray += gStride;
            }
        }

now it only use 5ms to process. Thank you.