ScalePlaneDown2 should be efficient for odd widths

GoogleCodeExporter commented 9 years ago

Especially for arm vs. NEON, it would be significantly cheaper to do 
ScaleRowDown2 with a properly aligned dst_stride (and operate over junk pixels) 
than to limit exactly to an unaligned dst_width. A caller that knew that could 
lie about the dst_width and set it to dst_stride in calling the function, but 
that shouldn't be necessary.

Original issue reported on code.google.com by noah...@google.com on 14 Feb 2014 at 7:54

GoogleCodeExporter commented 9 years ago

To know if a libyuv function will be fast path, a rule of thumb is the width 
needs to be a multiple of 16, and the image pointer/stride should also be 
aligned to 16.
The pointer/stride alignment is less of a concern on Neon and AVX2, but most 
functions are optimized for aligned width.

Your suggested solution is overread/overwrite.  libyuv will allow you to do 
that, and its a good solution.
Allocate extra pad bytes for rows and/or images.
Conversion functions will check if width == stride and treat the image as one 
large row.  You can do that yourself, and pad out the total (width * height + 
15) & ~15;

Scaling can't be row coalesced, but you can allocate aligned rows.
Allocate buffers with stride = (width + 15) & ~15; and image_size = stride * 
height;
In the case of scaling to 1/2, the destination needs to be a multiple of 16, so 
source would be a multiple of 32.

I did experiment with overreads/overwrites, but it was deemed unsafe.  So the 2 
solutions I've come up with are 'any' functions, and row coalescing.
any functions on intel still prefer an aligned pointer, but handle 'any' width, 
by doing the multiple of 16, and then handling the remainder.  Most handle the 
remainder using C code, but some functions redo work on the 'last16' pixels, 
which is an overread/overwrite of data already processed, but within the row.
This is supported for conversions, but not scaling.

The unittests check for overread/write by allocating images at the end of a 
page, and are run thru valgrind.

So the action item here is to implement scale_any.cc which has a wrapper for 
each scale row function that handles odd sizes.
Its not hard, and it may even exist already for 1/2 size, since that comes up 
in conversions/effects.

Original comment by fbarch...@chromium.org on 21 Feb 2014 at 11:27

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Best long term solution will be allow pointers to be unaligned - albiet slower, 
and allow width and/or stride to be 'any'.

Another user suggested an 'overread' mode, which was tried in the past.  Its 
efficient but dangerous, so the 'any' approach is preferred.
Also row coalescing was added to allow contiguous images to be handled 
efficiently.

Changing nature of this bug to efficient odd width scaling support.

Original comment by fbarch...@google.com on 28 Jul 2014 at 10:01

Changed title: ScalePlaneDown2 should be efficient for odd widths
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

ScalePlaneDown2 is also the highest on profiles for the scaler, and should be 
AVX2 optimized.

Original comment by fbarch...@google.com on 27 Nov 2014 at 1:49

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

r1345 supports odd width ScalePlaneDown2

set LIBYUV_WIDTH=1276 
set LIBYUV_HEIGHT=720 
set LIBYUV_REPEAT=3999 
set LIBYUV_FLAGS=-1 

out\release\libyuv_unittest_old --gtest_filter=*.ScaleDownBy2*   | findstr ms 
[       OK ] libyuvTest.ScaleDownBy2_None (1157 ms)
[       OK ] libyuvTest.ScaleDownBy2_Linear (2422 ms)
[       OK ] libyuvTest.ScaleDownBy2_Bilinear (2891 ms)
[       OK ] libyuvTest.ScaleDownBy2_Box (2891 ms)

out\release\libyuv_unittest --gtest_filter=*.ScaleDownBy2*   | findstr ms 
[       OK ] libyuvTest.ScaleDownBy2_None (422 ms)
[       OK ] libyuvTest.ScaleDownBy2_Linear (484 ms)
[       OK ] libyuvTest.ScaleDownBy2_Bilinear (625 ms)
[       OK ] libyuvTest.ScaleDownBy2_Box (625 ms)

set LIBYUV_WIDTH=1280 
set LIBYUV_HEIGHT=720 
set LIBYUV_REPEAT=3999 
set LIBYUV_FLAGS=-1 

out\release\libyuv_unittest --gtest_filter=*.ScaleDownBy2*   | findstr ms 
[       OK ] libyuvTest.ScaleDownBy2_None (343 ms)
[       OK ] libyuvTest.ScaleDownBy2_Linear (407 ms)
[       OK ] libyuvTest.ScaleDownBy2_Bilinear (500 ms)
[       OK ] libyuvTest.ScaleDownBy2_Box (500 ms)

Original comment by fbarch...@chromium.org on 26 Mar 2015 at 6:11

Changed state: Started
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

ScalePlaneDown2 ported to AVX2
Was ScaleDownBy2_Box (500 ms)
Now ScaleDownBy2_Box (437 ms)

and for odd widths
Was ScaleDownBy2_Box (2890 ms)
Now ScaleDownBy2_Box (625 ms)

The known case where half size is slow is when its not exactly half.

set LIBYUV_WIDTH=1276
ScaleDownBy2_Box (752 ms)
ScaleDownBy2_Bilinear (741 ms)
ScaleDownBy2_None (666 ms)
ScaleDownBy2_Linear (628 ms)

set LIBYUV_WIDTH=1278
ScaleDownBy2_Bilinear (1712 ms)
ScaleDownBy2_None_16 (1510 ms)
ScaleDownBy2_Linear (1395 ms)
ScaleDownBy2_None (1086 ms)

This is because the chroma channel is an odd width, and the half size version 
of it produces a scale factor that of 1.996875

Original comment by fbarch...@chromium.org on 26 Mar 2015 at 10:55

Added labels: ****
Removed labels: ****

WangYongHai / libyuv

ScalePlaneDown2 should be efficient for odd widths #314