floe / backscrub

Virtual Video Device for Background Replacement with Deep Semantic Segmentation
Apache License 2.0
735 stars 86 forks source link

Optimization potential #138

Open BenBE opened 2 years ago

BenBE commented 2 years ago

Doing some rudimentary profiling I noticed, we are wasting a huge amount of time of the main thread in essentially 3 functions:

  1. alpha_blend (~40%)
  2. convert_rgb_to_yuyv (~25%)
  3. cap.retrieve (~20%)

Values are relative time for main, which takes up roughly the same amount of time, we actually spend processing images (main ~32.75%, bs_maskgen_process ~33.95%, total runtime).

On the positive: bs_maskgen_process spends ~95% of the time waiting for processing by TFLite.

FWIW: Timings heavily affected by running under callgrind, but looking at the code of alpha_blend and convert_rgb_to_yuyv I'm not really surprised of these results. A rewrite of these functions using some vectoring should yield quite a bit of improvement.

floe commented 2 years ago

Ah, that does sound like SSE and friends could do a good job speeding things up. I guess the most portable way to do that is using gcc intrinsics?

BenBE commented 2 years ago

Either GCC intrinsics (as portable as they are) or we try to look into things and see if OpenCV/XNN already offers those two methods with an optimized version we could use.

phlash commented 2 years ago

I'm aware that those were written because OpenCV doesn't do them.. but happy to be proven wrong if we up the minimum OCV version :smile:

Found this Q&A: https://answers.opencv.org/question/211941/how-to-use-simd-feature-of-opencv-as-third-party-library/

I'm also aware that OCV will try to use OpenCL and GPU if they are available (it collided with TFLite when used in multiple threads!).. more research needed!

[edited to add] Found this which looks near for alpha_blend: https://stackoverflow.com/questions/13128696/which-is-the-most-efficient-way-to-do-alpha-mask-in-opencv

There is also libvolk for portable SIMD kernels (as used by Gnuradio and others), which is a standard package on many Linuxen.

BenBE commented 2 years ago

I don't have any objections on using a library for those particular vectorized kernels, as doing those optimizations ourselves would be more work than sensible.

Also check if other places in the current code base (e.g. the conv_bias stuff) could benefit from using the library.

martok commented 2 years ago

Just talked with @BenBE, time for my regularly scheduled one-off maybe-useful comment :)

alpha_blend (~40%)

OpenCV has a well-hidden function for that, blendLinear. Only difficulty here would be that it requires the mask and anti-mask, both as float. So the call would look like cv::blendLinear(bg, raw, mask, Scalar::all(1.0)-mask, raw).

I don't actually know if this will be much faster, but they do have vectorised and OCL implementations.

convert_rgb_to_yuyv (~25%)

Really surprised that there is no convert_rgb_to_yuyv equivalent in cv, but it seems that unrolling the loop a few times in a way that explicitly looks like interleaved moves could be enough to get the compiler to generate the vectorised version without manual trickery.

phlash commented 2 years ago

OK! Thanks for the pointer towards cv::blendLinear, I've just tried it and the performance gain seems marginal (~0.8 of previous run time). Digging in, I find that ~1/3 of the time is converting/inverting the mask, the rest blending, so it seems blendLinear is not implicitly using SIMD/OCL? Going back to local loop in a separate source file and playing with gcc intrinsics..

BenBE commented 2 years ago

If the initial conversion from uint8_t to float takes so much time, there is a good chance we can remove that conversion time when generating the mask in the first place, as the output from TFlite actually is float already, thus we already spend time to create the uint8_t version, but could just skip this part …

phlash commented 2 years ago

Yep - when we get down to shaving a few % that's worth it - looking for 10x improvement through SIMD first :smile:

phlash commented 2 years ago

So: I tried MMX, loading a pixel at a time (3 channels) into an _m64 executing the multiplies, adds & a shift to divide down, then extracted the pixel back to memory - at least 5x slower than a loop :disappointed:

I went back to the original loop code, separated into it's own source file and applied -march=native -O9, 3-4x faster :smile: This will do for me - it's easier than SIMD, portable and guaranteed not to be worse than we have?

BenBE commented 2 years ago

Doing some proper calculations, it's quite obvious from hardware architecture alone that loading 3 bytes at a time into one MMX register for processing is not gonna cut it: You'll have at least tons of unaligned memory accesses and half the pipeline usually unoccupied. Given that we know based on the size restrictions introduced by MJPG from the capture device, we know we'll always have a multiple of 16 pixels*, thus could unroll the per-pixel operation 16x and process 48 bytes at a time.

If we then use some SSE magic, we get an inner kernel that could perform this blending operation with the constraints of just 7 registers** needed.

// Setup:
uint8_t* a,b; // Source image
uint8_t* c; // Mask image
uint8_t* d; // Destination image

// Kernel (advance by a+=48; b+=48; c+=16; d+=48; per iteration; prefetch linerar)
xmm0 = *(a + 0);
xmm1 = *(a + 16);
xmm2 = *(a + 32);
separateRGB(xmm0, xmm1, xmm2 --(xmm6)--> xmm0, xmm1, xmm2);

xmm3 = *(b + 0);
xmm4 = *(b + 16);
xmm5 = *(b + 32);
separateRGB(xmm3, xmm4, xmm5 --(xmm6)--> xmm3, xmm4, xmm5);

xmm6 = *(c + 0); // ***

xmm0 = xmm0 * xmm6 >> 8;
xmm1 = xmm1 * xmm6 >> 8;
xmm2 = xmm2 * xmm6 >> 8;

xmm6 = ~xmm6;

xmm3 = xmm3 * xmm6 >> 8;
xmm4 = xmm4 * xmm6 >> 8;
xmm5 = xmm5 * xmm6 >> 8;

xmm0 += xmm3;
xmm1 += xmm4;
xmm2 += xmm5;

mergeRGB(xmm0, xmm1, xmm2 --(xmm6)--> xmm0, xmm1, xmm2);
*(d + 0) = xmm0;
*(d + 16) = xmm1;
*(d + 32) = xmm2;

Proper pipelining of the instructions inside the kernel loop should allow for the actual memory fetches and register assignments to run with minimal congestion.

** Technically we have 8 available for SSE, but with clever ordering of things you might be able to get down to only requiring 7 to hold all (intermediate) data.

*** If you happen to get this move scheduled before loading the second image you could start linear-blending the source image while still loading the data for the target image.

phlash commented 2 years ago

Sounds like you know waay more than I ever will @BenBE :smile: I've never looked into SIMD stuff before now, so my feeble attempt was through cut/paste/fixup from stack overflow of an example MMX alpha blend function.

My current pr in #139 reduces the blend time to just under 1ms for the majority of executions (I'm simply using the execution timing built into the app), I would be delighted to see SIMD code based on the above analysis, then we can decide if the additional complexity is worth the time saving?

BenBE commented 2 years ago

Just noticed you could even skip the channel reordering, because the pixel scaling is uniform across all channels, which makes the kernel even simpler …

Will have to look into the actual GCC intrinsic stuff though.