dubhater / vapoursynth-mvtools

Motion compensation and stuff
181 stars 27 forks source link

Loops everywhere - Cleanup init in GetFrame and Degrain SSE2 code #58

Closed adworacz closed 1 year ago

adworacz commented 1 year ago

This is purely cosmetics. There should be no logic changes.

I tested this with a 4k, 2k, and 540p video. I could find no discernable difference in performance between the two, and it takes up a lot less code.

Results: Degrain6 SSE2 WITHOUT loops on 8-bit, 1000 frames 540p = 77.5433 1080p = 50.5460 4k = 12.0173

Degrain6 SSE2 WITH loops on 8-bit, 1000 frames 540p = 79.3693 1080p = 49.8574 4k = 11.9895

This is with GCC 12.2.0 on Arch Linux.


Edit: Most recent commit is even faster:

Eliminate a loop in Degrain SSE2, providing even better performance.

Small improvement over the previous commit. We're looping one less time and gain a small speed boost.

Interestingly, I tried eliminating even more loops (like the accum loop), eliminating temp variables, and a few other things, and it was always slower. This version of the code is a quite simple change and is the fastest. Who'dda thunk.

Tested with MDegrain6 on 4k, 8-bit footage with 8 and 16 size blocks, espectively. Forced to use SSE2 by commenting out all other code paths.

GCC 12.2.0

Slower runs are blksize 8, faster runs are blksize 16.

Before:

/mnt/.../Footage/test >>> hyperfine --show-output -r 3 -P output 1 2 'vspipe -p -e 500 -o {output} tester.vpy /dev/null'
Benchmark 1: vspipe -p -e 500 -o 1 tester.vpy /dev/null
Script evaluation done in 3.27 seconds
Output 501 frames in 80.49 seconds (6.22 fps)
Script evaluation done in 3.31 seconds
Output 501 frames in 80.83 seconds (6.20 fps)
Script evaluation done in 3.29 seconds
Output 501 frames in 80.85 seconds (6.20 fps)
  Time (mean ± σ):     84.266 s ±  0.227 s    [User: 3817.761 s, System: 6.674 s]
  Range (min … max):   84.004 s … 84.405 s    3 runs

Benchmark 2: vspipe -p -e 500 -o 2 tester.vpy /dev/null
Script evaluation done in 3.33 seconds
Output 501 frames in 28.30 seconds (17.70 fps)
Script evaluation done in 3.30 seconds
Output 501 frames in 28.25 seconds (17.74 fps)
Script evaluation done in 3.33 seconds
Output 501 frames in 28.16 seconds (17.79 fps)
  Time (mean ± σ):     31.790 s ±  0.072 s    [User: 1331.485 s, System: 5.238 s]
  Range (min … max):   31.722 s … 31.866 s    3 runs

Summary
  'vspipe -p -e 500 -o 2 tester.vpy /dev/null' ran
    2.65 ± 0.01 times faster than 'vspipe -p -e 500 -o 1 tester.vpy /dev/null'

After:

/mnt/.../Footage/test >>> hyperfine --show-output -r 3 -P output 1 2 'vspipe -p -e 500 -o {output} tester.vpy /dev/null'                           [130]
Benchmark 1: vspipe -p -e 500 -o 1 tester.vpy /dev/null
Script evaluation done in 3.24 seconds
Output 501 frames in 79.45 seconds (6.31 fps)
Script evaluation done in 3.29 seconds
Output 501 frames in 79.88 seconds (6.27 fps)
Script evaluation done in 3.30 seconds
Output 501 frames in 79.77 seconds (6.28 fps)
  Time (mean ± σ):     83.235 s ±  0.255 s    [User: 3772.476 s, System: 7.349 s]
  Range (min … max):   82.948 s … 83.434 s    3 runs

Benchmark 2: vspipe -p -e 500 -o 2 tester.vpy /dev/null
Script evaluation done in 3.31 seconds
Output 501 frames in 27.66 seconds (18.12 fps)
Script evaluation done in 3.28 seconds
Output 501 frames in 27.71 seconds (18.08 fps)
Script evaluation done in 3.30 seconds
Output 501 frames in 27.65 seconds (18.12 fps)
  Time (mean ± σ):     31.199 s ±  0.023 s    [User: 1303.617 s, System: 5.921 s]
  Range (min … max):   31.181 s … 31.224 s    3 runs

Summary
  'vspipe -p -e 500 -o 2 tester.vpy /dev/null' ran
    2.67 ± 0.01 times faster than 'vspipe -p -e 500 -o 1 tester.vpy /dev/null'
dubhater commented 1 year ago

Thanks!