almondyoung / libyuv

Automatically exported from code.google.com/p/libyuv
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

BoxFilter performance #425

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
scale functions with box filter should
1. be optimized for avx2
2. support odd widths
3. support heights of 1 without falling back on c

also consider row at a time instead of columns.

Original issue reported on code.google.com by fbarch...@chromium.org on 13 Apr 2015 at 6:25

GoogleCodeExporter commented 8 years ago
r1366 changes sse2 to allow height = 1.
set LIBYUV_WIDTH=1920
set LIBYUV_HEIGHT=1080
set LIBYUV_REPEAT=1000
out\release\libyuv_unittest.exe --gtest_filter=*.ScaleTo640* | findstr ms
Was
ScaleTo640x360_None (245 ms)
ScaleTo640x360_Linear (225 ms)
ScaleTo640x360_Bilinear (201 ms)
ScaleTo640x360_Box (1476 ms)

Now
ScaleTo640x360_None (255 ms)
ScaleTo640x360_Linear (244 ms)
ScaleTo640x360_Bilinear (202 ms)
ScaleTo640x360_Box (1460 ms)

Original comment by fbarch...@chromium.org on 13 Apr 2015 at 6:57

GoogleCodeExporter commented 8 years ago
r1367 adds AVX2 box filter
For 640x3600 to 640x360:

Was SSE2
[ RUN      ] libyuvTest.ScaleTo640x360_Box
filter 3 -     5101 us C -     1003 us OPT
[       OK ] libyuvTest.ScaleTo640x360_Box (1063 ms)

Now AVX2
[ RUN      ] libyuvTest.ScaleTo640x360_Box
filter 3 -     4224 us C -      823 us OPT
[       OK ] libyuvTest.ScaleTo640x360_Box (875 ms)

Original comment by fbarch...@chromium.org on 14 Apr 2015 at 12:49

GoogleCodeExporter commented 8 years ago
set LIBYUV_WIDTH=1900

out\release\libyuv_unittest.exe

[  PASSED  ] 785 tests.
[  FAILED  ] 14 tests, listed below:
[  FAILED  ] libyuvTest.ARGBScaleClipTo320x240_Box
[  FAILED  ] libyuvTest.ARGBScaleClipFrom320x240_Box
[  FAILED  ] libyuvTest.ARGBScaleTo352x288_Box
[  FAILED  ] libyuvTest.ARGBScaleClipFrom352x288_Box
[  FAILED  ] libyuvTest.ARGBScaleClipTo569x480_Box
[  FAILED  ] libyuvTest.ARGBScaleClipFrom569x480_Box
[  FAILED  ] libyuvTest.ARGBScaleClipTo640x360_Box
[  FAILED  ] libyuvTest.ARGBScaleClipFrom640x360_Box
[  FAILED  ] libyuvTest.ARGBScaleClipFrom1280x720_Box
[  FAILED  ] libyuvTest.ScaleFrom320x240_Box
[  FAILED  ] libyuvTest.ScaleFrom352x288_Box
[  FAILED  ] libyuvTest.ScaleFrom569x480_Box
[  FAILED  ] libyuvTest.ScaleFrom640x360_Box
[  FAILED  ] libyuvTest.ScaleFrom1280x720_Box

14 FAILED TESTS

Original comment by fbarch...@google.com on 14 Apr 2015 at 10:41

GoogleCodeExporter commented 8 years ago
box filter code does not support source box width/height of less than 1
previously box filter was avoided for up sampling.
this was recently removed because down sampling height, while keeping width 
same, was switching to bilinear.
consider reintroducing the switch to bilinear, but only if the width goes up, 
not stays the same.  and/or height.

its unknown by clip fails, but I would guess the destination is small and the 
source for upsampling is less than 1 pixel.

Original comment by fbarch...@chromium.org on 16 Apr 2015 at 7:51

GoogleCodeExporter commented 8 years ago
Box filter is slow for odd width.  This is due to memory reading columns

set LIBYUV_WIDTH=1918
set LIBYUV_HEIGHT=1080
set LIBYUV_REPEAT=999
set LIBYUV_FLAGS=-1

out\debug\libyuv_unittest.exe --gtest_filter=*ScaleTo1x1_Box   | findstr /r 
"^[^_]*_[^_]*ms"
ScaleTo1x1_Box (805 ms)

set LIBYUV_WIDTH=1920
set LIBYUV_HEIGHT=1080
set LIBYUV_REPEAT=999
set LIBYUV_FLAGS=-1

out\debug\libyuv_unittest.exe --gtest_filter=*ScaleTo1x1_Box   | findstr /r 
"^[^_]*_[^_]*ms"
ScaleTo1x1_Box (356 ms)

suggest a row oriented function.

Original comment by fbarch...@chromium.org on 2 Jun 2015 at 1:31

GoogleCodeExporter commented 8 years ago
LIBYUV_WIDTH=1920 LIBYUV_HEIGHT=1080 LIBYUV_REPEAT=999 perf record 
out/Release/libyuv_unittest --gtest_filter=*ScaleTo640x360_Box*

64.98%  libyuv_unittest  libyuv_unittest      [.] ScaleAddRow_C
31.81%  libyuv_unittest  libyuv_unittest      [.] ScaleAddCols1_C
 2.19%  libyuv_unittest  libc-2.19.so         [.] memset
 0.64%  libyuv_unittest  libyuv_unittest      [.] ScalePlane
 0.19%  libyuv_unittest  [kernel.kallsyms]    [k] 0xffffffff8104f45a
 0.09%  libyuv_unittest  libyuv_unittest      [.] libyuv::TestFilter(int, int, int, int, libyuv::FilterMode, int, int)

Note memset is called once per row to clear accumulation buffer of ScaleAddRow_C

Original comment by fbarch...@google.com on 22 Sep 2015 at 10:58

GoogleCodeExporter commented 8 years ago
Intel profile:
Samples: 2K of event 'cycles', Event count (approx.): 2669930815
 72.96%  libyuv_unittest  libyuv_unittest      [.] ScaleAddCols1_C
 20.23%  libyuv_unittest  libyuv_unittest      [.] ScaleAddRow_AVX2
  4.95%  libyuv_unittest  libc-2.19.so         [.] memset
  0.76%  libyuv_unittest  libyuv_unittest      [.] ScalePlane
  0.50%  libyuv_unittest  [kernel.kallsyms]    [k] 0xffffffff8104f45a
  0.21%  libyuv_unittest  libyuv_unittest      [.] libyuv::TestFilter(int, int, int, int, libyuv::FilterMode, int, int, int)
  0.13%  libyuv_unittest  libyuv_unittest      [.] ScaleAddRow_C
  0.07%  libyuv_unittest  libc-2.19.so         [.] _int_malloc
  0.04%  libyuv_unittest  libyuv_unittest      [.] memset@plt
  0.03%  libyuv_unittest  libyuv_unittest      [.] I420Scale
  0.02%  libyuv_unittest  libc-2.19.so         [.] __memcpy_sse2_unaligned
shows memset is a bit high on profile.  It could be avoided: memset is called 
in 2 places.  Once per row for the accumulation buffer, and for odd widths, 
once per row to clear the simd buffers.

Benchmark on Arm, where AddRow is C
util/android/test_runner.py gtest -s libyuv_unittest -t 7200 --verbose 
--release --gtest_filter=*ScaleDownBy?_* -a "--libyuv_width=1280 
--libyuv_height=720 --libyuv_repea
t=999 --libyuv_flags=-1" | grep ms | sed 's/\(.*(\)\([0-9]*\)\( ms)\)/\2 - 
\1\2\3/g' | sort -rn | sed 's/.*- \(.*\)/\1/g'
I  521.236s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy8_Box (219165 ms)
I  521.237s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy3_Box (49810 ms)
I  521.232s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy4_Box (30018 ms)
I  521.233s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy8_Bilinear (18233 ms)
I  521.233s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy8_Box (18164 ms)
I  521.232s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy8_Linear (15275 ms)
I  521.232s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy4_Bilinear (11854 ms)
I  521.236s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy8_Bilinear (11296 ms)
I  521.232s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy4_Linear (9126 ms)
I  521.235s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy4_Bilinear (8679 ms)
I  521.232s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy8_None (7800 ms)
I  521.235s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy8_Linear (7134 ms)
I  521.235s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy4_Linear (6575 ms)
I  521.231s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy2_Box (6391 ms)
I  521.231s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy2_Bilinear (6250 ms)
I  521.235s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy4_Box (5943 ms)
I  521.235s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy8_None (5066 ms)
I  521.231s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy4_None (4888 ms)
I  521.236s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy3_None (4863 ms)
I  521.236s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy3_Linear (4677 ms)
I  521.231s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy2_Linear (4228 ms)
I  521.236s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy3_Bilinear (4017 ms)
I  521.233s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy3_None (3690 ms)
I  521.233s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy3_Bilinear (3674 ms)
I  521.233s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy3_Linear (3669 ms)
I  521.233s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy3_Box (3654 ms)
I  521.231s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ARGBScaleDownBy2_None (2618 ms)
I  521.234s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy2_Bilinear (2576 ms)
I  521.234s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy2_Box (2562 ms)
I  521.234s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy4_None (2123 ms)
I  521.234s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy2_Linear (1843 ms)
I  521.234s run_tests_on_device(HT4A2JT03762)  [       OK ] 
LibYUVScaleTest.ScaleDownBy2_None (1111 ms)
I  521.237s run_tests_on_device(HT4A2JT03762)  [----------] 32 tests from 
LibYUVScaleTest (486988 ms total)
I  521.237s run_tests_on_device(HT4A2JT03762)  [==========] 32 tests from 1 
test case ran. (486989 ms total)

Original comment by fbarch...@chromium.org on 16 Nov 2015 at 11:55

GoogleCodeExporter commented 8 years ago
Intel version working as expected.  C for columns, AVX2 for rows.
Consider future AVX2 for columns and Neon versions for rows/columns.

Original comment by fbarch...@google.com on 26 Jan 2016 at 1:27