dubhater / vapoursynth-mvtools

Motion compensation and stuff
181 stars 27 forks source link

Enable LTO and use __restrict on Degrain_C: 9-13% faster Degrain6 on 16-bit sources #62

Closed adworacz closed 1 year ago

adworacz commented 1 year ago

As it says on the tin - this turns on Link Time Optimization officially (which has been the default on Arch Linux for a while now).

Additionally, added the __restrict keyword to the Degrain_C function, which enables compilers to better optimize memory access/register usage. This only impacts 16-bit sources, as 8-bit uses the SSE2 implementation and isn't effected.

All outputs are identical between the master-clang, lto-clang, and lto-restrict-clang libraries. Used Clang as the compiler.

Tests were run with blocksize 8 and 16, on 4k, 1080p, and 540p content, with Degrain6 and Analyse (no Recalculate).

-o 1 == blocksize 8 -o 2 == blocksize 16

Results:

1080p and 540p:

Command Mean [s] Min [s] Max [s] Relative
vspipe -p -e 3000 -o 1 --arg mvversion=master-clang --arg src=test-1080p.dgi tester.vpy /dev/null 148.705 ± 0.281 148.385 148.913 7.30 ± 0.02
vspipe -p -e 3000 -o 1 --arg mvversion=lto-clang --arg src=test-1080p.dgi tester.vpy /dev/null 132.927 ± 0.056 132.867 132.978 6.53 ± 0.01
vspipe -p -e 3000 -o 1 --arg mvversion=lto-restrict-clang --arg src=test-1080p.dgi tester.vpy /dev/null 131.146 ± 0.173 131.022 131.343 6.44 ± 0.01
vspipe -p -e 3000 -o 2 --arg mvversion=master-clang --arg src=test-1080p.dgi tester.vpy /dev/null 76.785 ± 0.046 76.755 76.838 3.77 ± 0.00
vspipe -p -e 3000 -o 2 --arg mvversion=lto-clang --arg src=test-1080p.dgi tester.vpy /dev/null 72.265 ± 0.012 72.257 72.279 3.55 ± 0.00
vspipe -p -e 3000 -o 2 --arg mvversion=lto-restrict-clang --arg src=test-1080p.dgi tester.vpy /dev/null 70.463 ± 0.014 70.446 70.471 3.46 ± 0.00
vspipe -p -e 3000 -o 1 --arg mvversion=master-clang --arg src=test-540p.dgi tester.vpy /dev/null 41.942 ± 0.026 41.917 41.969 2.06 ± 0.00
vspipe -p -e 3000 -o 1 --arg mvversion=lto-clang --arg src=test-540p.dgi tester.vpy /dev/null 38.550 ± 0.028 38.532 38.583 1.89 ± 0.00
vspipe -p -e 3000 -o 1 --arg mvversion=lto-restrict-clang --arg src=test-540p.dgi tester.vpy /dev/null 37.955 ± 0.024 37.941 37.982 1.86 ± 0.00
vspipe -p -e 3000 -o 2 --arg mvversion=master-clang --arg src=test-540p.dgi tester.vpy /dev/null 21.950 ± 0.012 21.937 21.958 1.08 ± 0.00
vspipe -p -e 3000 -o 2 --arg mvversion=lto-clang --arg src=test-540p.dgi tester.vpy /dev/null 21.101 ± 0.022 21.083 21.126 1.04 ± 0.00
vspipe -p -e 3000 -o 2 --arg mvversion=lto-restrict-clang --arg src=test-540p.dgi tester.vpy /dev/null 20.362 ± 0.019 20.346 20.383 1.00

4k:

Command Mean [s] Min [s] Max [s] Relative
vspipe -p -e 500 -o 1 --arg mvversion=master-clang --arg src=test-4k.dgi tester.vpy /dev/null 110.780 ± 0.016 110.770 110.799 2.05 ± 0.00
vspipe -p -e 500 -o 1 --arg mvversion=lto-clang --arg src=test-4k.dgi tester.vpy /dev/null 106.270 ± 12.193 99.224 120.349 1.96 ± 0.23
vspipe -p -e 500 -o 1 --arg mvversion=lto-restrict-clang --arg src=test-4k.dgi tester.vpy /dev/null 97.966 ± 0.076 97.887 98.039 1.81 ± 0.00
vspipe -p -e 500 -o 2 --arg mvversion=master-clang --arg src=test-4k.dgi tester.vpy /dev/null 58.903 ± 0.004 58.900 58.907 1.09 ± 0.00
vspipe -p -e 500 -o 2 --arg mvversion=lto-clang --arg src=test-4k.dgi tester.vpy /dev/null 55.515 ± 0.048 55.463 55.559 1.03 ± 0.00
vspipe -p -e 500 -o 2 --arg mvversion=lto-restrict-clang --arg src=test-4k.dgi tester.vpy /dev/null 54.124 ± 0.041 54.089 54.169 1.00

Note how the lto-restrict-clang is always the fastest of the bunch, sometimes by 3-4% just by using __restrict, while enabling LTO can have a 6-11% improvement on its own.

dubhater commented 1 year ago

Thanks!