ermig1979 / Simd

C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.
http://ermig1979.github.io/Simd
MIT License
2.04k stars 407 forks source link

SimdYuv420pToBgr with different YUV & RGB image sizes #162

Closed mikeversteeg closed 3 years ago

mikeversteeg commented 3 years ago

Hi!

Still loving this library, it is really impressive coding..

I need to display YUV420P images as thumbnails on screen, which means they need to be converted to RGB and resized. Because this is HD video, speed is essential. Currently I first resize each of the YUV planes, and then convert to RGB. However this means an additional memory write of the (smaller) YUV image. This can be avoiding by dropping the demand that for SimdYuv420pToBgr both images must have same size. Can this be added? As speed is important, a simple pixel drop can be used (although if it can be added efficiently, a basic (bi)lineair interpolation would be great).

Thanks for considering.

ermig1979 commented 3 years ago

Hi!

In this case I would recommend to use consecutive call of function ReduceGray2x2 to fast reducing of Y, U and V planes. At final stage use function ResizeArea. And only then use Yuv444pToBgr to conversion. Although this method uses additional buffer it achievs maximal performance.

mikeversteeg commented 3 years ago

But the additional memory access is a limiting factor in this case. Resizing while converting would be much faster as there would only be a single memory read and write.

ermig1979 commented 3 years ago

Yes of course the memory access is a limiting factor. But call of ReduceGray2x2 reduces itone in 4 times. We use this method in our video analytics pipeline to resize Full HD Yuv420p to smaler RGB (480x270, for example). Mixing of resizing and conversion in one algorithm is too complicated and does not give any sufficent performance gain.

ermig1979 commented 3 years ago

It may be useful to create a class which implement this behaviour (with name something like YuvToThumbnail).

mikeversteeg commented 3 years ago

Avoiding the additional memory access will improve performance, I have already established this. Note display can still be full size, although typically it will be around 25%. For a dozen streams that's around 0.5 GB/s for each single memory write. It is easy to exceed memory bandwidth..

ermig1979 commented 3 years ago

If temporary buffer size is lesser than L3 cache size / thread count that this is not important. See matrix multiplication algorithm (it wide uses data reordering into temporal buffer to achieve of performance maximum).

As I see the main bottleneck is loading of source image YUV420P (80% of memory access). Subsequent loading gives about 20% and lesser if buffer fits into L3 cache).

YUV420P 1920x1080 has size about 3 MB, temporary buffer is lesser than 1 MB it is lesser than L3 cache per thread).

mikeversteeg commented 3 years ago

Barely, at 50% size it already exceeds the L3 cache size on e.g. a 8 core Xeon 2286m. But you're also assuming your thread isn't interrupted, which you cannot guarantee AFAIK? My app runs hundreds of threads so there is a lot of context switching going on..

Anyway, I can only ask and appreciate the work you do.

mikeversteeg commented 3 years ago

I do not fully understand the documentation, but I understand you say that

  1. call n 3 ReduceGray2x2
  2. call ResizeArea
  3. call Yuv444pToBgr

is faster (and presumably better) than

  1. call 3 * SimdResizeBilinear
  2. call SimdYuv420pToBgr

correct? I would not expect that.

Would have loved to try it out but unfortunately ResizeArea is not in SimdLib.h which I have been using. I updated not too long ago.

mikeversteeg commented 3 years ago

How does SimdResizeGray2x2 compare? I don't know how it works, the documentation is minimal.

ermig1979 commented 3 years ago

The method described above is used for resizing from big YUV420P to small BGR with a purpose of objects detection or recognition. It gives maximal correct reduced image. If the quality of reduced image is not so important you of course may use you method and it will be faster. The first note: I would recommend to resize Y, U and V planes to the same size and use then frunction SimdYuv444pToBgr. The second note: if reduce coefficient is to large then function ResizeBilinear gives result close to ResizeNearest (it is not imlemented). Itone can have poor quality.

mikeversteeg commented 3 years ago

Thank you. Indeed ResizeBilinear gives distorted pixels so I cannot use it, I need something else (that is available in SimdLib and fast). Any ideas? SimdStretchGray2x2?

I am not familiar with the word "ltone", what does it mean?

You should add a Donate button, thanks for the quick replies and assistance!

ermig1979 commented 3 years ago

Simd::Resize with parameter SimdResizeMethodArea gives the best result.

"itone" - is a my mistake I meant "this one".

I had donate "button" at my past project (AntiDupl) and it gave about 37 dollars for 13 years. I scare to add this button - I can't carry the weight of so big amount of cash :)

mikeversteeg commented 3 years ago

Can you give me an email address for a PayPal payment? I always donate for free software, helps me sleep.

ermig1979 commented 3 years ago

Thanks. I very appreciate that you want to donate to my project. It is not mean that I don't appreciate money but I thing that the best donate for open source project is a public mention about it. Unless, of course, this makes it difficult for you.

mikeversteeg commented 3 years ago

Your support goes above and beyond what is required for an open source project. Nonetheless I thankfully added it to my Credits section (http://help.vidblasterx6.com/CreditsDisclaimer.html).

ermig1979 commented 3 years ago

Thanks!

mikeversteeg commented 3 years ago

Regarding SimdResizerRun, I am not sure what these channels are, are these 3 last parameters correct?

pResizeContext = SimdResizerInit(widthin, heightin, widthout, heightout, 1, SimdResizeChannelByte, SimdResizeMethodArea);

ermig1979 commented 3 years ago

Yes, its correct parameters for resizing of Y, U and V planes.