This is purely cosmetics. There should be no logic changes.
I tested this with a 4k, 2k, and 540p video. I could find no discernable
difference in performance between the two, and it takes up a lot less
code.
Results:
Degrain6 SSE2 WITHOUT loops on 8-bit, 1000 frames
540p = 77.5433
1080p = 50.5460
4k = 12.0173
Degrain6 SSE2 WITH loops on 8-bit, 1000 frames
540p = 79.3693
1080p = 49.8574
4k = 11.9895
This is with GCC 12.2.0 on Arch Linux.
Edit: Most recent commit is even faster:
Eliminate a loop in Degrain SSE2, providing even better performance.
Small improvement over the previous commit. We're looping one less time
and gain a small speed boost.
Interestingly, I tried eliminating even more loops (like the accum
loop), eliminating temp variables, and a few other things, and it was
always slower. This version of the code is a quite simple change and is
the fastest. Who'dda thunk.
Tested with MDegrain6 on 4k, 8-bit footage with 8 and 16 size blocks,
espectively. Forced to use SSE2 by commenting out all other code paths.
GCC 12.2.0
Slower runs are blksize 8, faster runs are blksize 16.
Before:
/mnt/.../Footage/test >>> hyperfine --show-output -r 3 -P output 1 2 'vspipe -p -e 500 -o {output} tester.vpy /dev/null'
Benchmark 1: vspipe -p -e 500 -o 1 tester.vpy /dev/null
Script evaluation done in 3.27 seconds
Output 501 frames in 80.49 seconds (6.22 fps)
Script evaluation done in 3.31 seconds
Output 501 frames in 80.83 seconds (6.20 fps)
Script evaluation done in 3.29 seconds
Output 501 frames in 80.85 seconds (6.20 fps)
Time (mean ± σ): 84.266 s ± 0.227 s [User: 3817.761 s, System: 6.674 s]
Range (min … max): 84.004 s … 84.405 s 3 runs
Benchmark 2: vspipe -p -e 500 -o 2 tester.vpy /dev/null
Script evaluation done in 3.33 seconds
Output 501 frames in 28.30 seconds (17.70 fps)
Script evaluation done in 3.30 seconds
Output 501 frames in 28.25 seconds (17.74 fps)
Script evaluation done in 3.33 seconds
Output 501 frames in 28.16 seconds (17.79 fps)
Time (mean ± σ): 31.790 s ± 0.072 s [User: 1331.485 s, System: 5.238 s]
Range (min … max): 31.722 s … 31.866 s 3 runs
Summary
'vspipe -p -e 500 -o 2 tester.vpy /dev/null' ran
2.65 ± 0.01 times faster than 'vspipe -p -e 500 -o 1 tester.vpy /dev/null'
After:
/mnt/.../Footage/test >>> hyperfine --show-output -r 3 -P output 1 2 'vspipe -p -e 500 -o {output} tester.vpy /dev/null' [130]
Benchmark 1: vspipe -p -e 500 -o 1 tester.vpy /dev/null
Script evaluation done in 3.24 seconds
Output 501 frames in 79.45 seconds (6.31 fps)
Script evaluation done in 3.29 seconds
Output 501 frames in 79.88 seconds (6.27 fps)
Script evaluation done in 3.30 seconds
Output 501 frames in 79.77 seconds (6.28 fps)
Time (mean ± σ): 83.235 s ± 0.255 s [User: 3772.476 s, System: 7.349 s]
Range (min … max): 82.948 s … 83.434 s 3 runs
Benchmark 2: vspipe -p -e 500 -o 2 tester.vpy /dev/null
Script evaluation done in 3.31 seconds
Output 501 frames in 27.66 seconds (18.12 fps)
Script evaluation done in 3.28 seconds
Output 501 frames in 27.71 seconds (18.08 fps)
Script evaluation done in 3.30 seconds
Output 501 frames in 27.65 seconds (18.12 fps)
Time (mean ± σ): 31.199 s ± 0.023 s [User: 1303.617 s, System: 5.921 s]
Range (min … max): 31.181 s … 31.224 s 3 runs
Summary
'vspipe -p -e 500 -o 2 tester.vpy /dev/null' ran
2.67 ± 0.01 times faster than 'vspipe -p -e 500 -o 1 tester.vpy /dev/null'
This is purely cosmetics. There should be no logic changes.
I tested this with a 4k, 2k, and 540p video. I could find no discernable difference in performance between the two, and it takes up a lot less code.
Results: Degrain6 SSE2 WITHOUT loops on 8-bit, 1000 frames 540p = 77.5433 1080p = 50.5460 4k = 12.0173
Degrain6 SSE2 WITH loops on 8-bit, 1000 frames 540p = 79.3693 1080p = 49.8574 4k = 11.9895
This is with GCC 12.2.0 on Arch Linux.
Edit: Most recent commit is even faster:
Eliminate a loop in Degrain SSE2, providing even better performance.
Small improvement over the previous commit. We're looping one less time and gain a small speed boost.
Interestingly, I tried eliminating even more loops (like the accum loop), eliminating temp variables, and a few other things, and it was always slower. This version of the code is a quite simple change and is the fastest. Who'dda thunk.
Tested with MDegrain6 on 4k, 8-bit footage with 8 and 16 size blocks, espectively. Forced to use SSE2 by commenting out all other code paths.
GCC 12.2.0
Slower runs are blksize 8, faster runs are blksize 16.
Before:
After: