Open NSQY opened 2 years ago
fvsfunc is also slightly faster:
Output 240 frames in 26.83 seconds (8.94 fps)
Here is my input file, but any YUV420P16 clip should do I guess.
Why is core.num_threads = 1
specified?
Because the AVS script is single threaded. It would not be a fair comparison otherwise.
What I am trying to say is that both Vapoursynth versions of GradFun3 use more CPU time than the Avisynth version to preform a similar action.
Here is performance scaling with PreFetch Avisynth seems to start misbehaving if PreFetch is set too high.
4 threads:
Output 240 frames in 8.32 seconds (28.84 fps)
FPS (min | max | average): 43.55 | 91.40 | 78.81
8 threads:
Output 240 frames in 4.58 seconds (52.42 fps)
FPS (min | max | average): 32.86 | 1289 | 97.96
12 threads:
Output 240 frames in 3.59 seconds (66.87 fps)
FPS (min | max | average): 21.68 | 692.2 | 79.80
16 threads:
Output 240 frames in 3.26 seconds (73.56 fps)
FPS (min | max | average): 1.562 | 833333 | 41.05
Replaced RGB input with BlankClip and ColorBarsHD()
1 thread:
Output 240 frames in 26.24 seconds (9.15 fps)
FPS (min | max | average): 17.37 | 31.17 | 24.21
4 threads:
Output 240 frames in 7.08 seconds (33.88 fps)
FPS (min | max | average): 22.15 | 277778 | 85.89
8 threads:
Output 240 frames in 3.99 seconds (60.08 fps)
FPS (min | max | average): 12.07 | 476191 | 122.3
12 threads:
Output 240 frames in 3.21 seconds (74.78 fps)
FPS (min | max | average): 7.720 | 909092 | 119.4
16 threads:
Output 240 frames in 2.97 seconds (80.69 fps)
FPS (min | max | average): 4.151 | 769231 | 78.37
One thing I noticed when looking at --filter-time output is that this function uses a lot of BoxBlur calls. std.Convolution is much faster when boxblur is using a radius if 1. I do not know about other radii.
import vapoursynth as vs
core = vs.core
clip = core.std.BlankClip(width=1920, height=1080, format=vs.YUV420P16, length=240)
box = core.std.BoxBlur(clip, hradius=1, hpasses=1, vradius=1, vpasses=1, planes=[0, 1, 2])
con = core.std.Convolution(clip, matrix=[1]*9, planes=[0, 1, 2])
box.set_output(1)
con.set_output(2)
vspipe -p boxblur.vpy -o 1 .
Script evaluation done in 0.07 seconds
Output 240 frames in 0.54 seconds (443.65 fps)
vspipe -p boxblur.vpy -o 2 .
Script evaluation done in 0.09 seconds
Output 240 frames in 0.23 seconds (1031.13 fps)
fvsfunc
raises ValueError: GradFun3: "radius" must be in 2-9 for smode=0 or 3 !
.
muf.BoxBlur()
is already optimized for small radius https://github.com/WolframRhodium/muvsfunc/blob/0f9b43e6cadd05d3e617231e6b8d6e27cb24a059/muvsfunc.py#L1888-L1889
I do notice a bug in my implementation and it will be fixed later.
As for the single-threaded performance, I don't think VapourSynth is designed to optimize for latency rather than throughput.
As for the single-threaded performance, I don't think VapourSynth is designed to optimize for latency rather than throughput.
You make a salient point. I had forgotten about PreFetch(). However, as per my updated comment muvsfunc.GradFun3 with 16 threads is slower than Avisynth with 4.
muf.BoxBlur() is already optimized for small radius
I missed this. Thanks.
fvsfunc raises ValueError: GradFun3: "radius" must be in 2-9 for smode=0 or 3 !.
We can just set radius to 9 for this, but difference between muvsfunc and fvsfunc seems to be a fmtc.bitdepth call.
for x in {1..2}
time vspipe -p gradfun.vpy -o $x .
Script evaluation done in 0.10 seconds
Output 240 frames in 2.54 seconds (94.65 fps)
vspipe -p gradfun.vpy -o $x . 53.70s user 1.48s system 1930% cpu 2.858 total
avg shared (code): 0 KB
avg unshared (data/stack): 0 KB
total (sum): 0 KB
max memory: 1631 MB
page faults from disk: 0
other page faults: 690306
Script evaluation done in 0.08 seconds
Output 240 frames in 2.58 seconds (92.85 fps)
vspipe -p gradfun.vpy -o $x . 52.82s user 1.66s system 1869% cpu 2.914 total
avg shared (code): 0 KB
avg unshared (data/stack): 0 KB
total (sum): 0 KB
max memory: 1708 MB
page faults from disk: 0
other page faults: 735205
Thanks for the details. It seems that fmtc.bitdepth
is a bottleneck. The performance should be improved after a later commit.
Anyway, std.BoxBlur
is also a bottleneck, and ex_GradFun3
utilizes a different implementatioin (GaussianBlur).
Gaussian blur can be implemented using tcanny.TCanny
(which is what ex_GradFun3
seems to use), but I want to keep the function in its original form as much as possible.
In terms of fvsfunc's implementation, it should be equivalent to the current implementation:
import muvs
from muvs import core
import muvsfunc
import fvsfunc
muvs.pollute()
img = core.imwri.Read(r'banding.png')
img = core.resize.Bicubic(img, format=vs.YUV420P16, matrix_s='709').std.Loop(240)
with open("test1.vpy", "w") as f:
with muvs.record(f):
gradfun = muvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)
with open("test2.vpy", "w") as f:
with muvs.record(f):
gradfun2 = fvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)
gradfun.set_output()
Just run this script and checks the two output files, which record the actual function calls.
It's a shame --filter-time in vspipe isn't more developed.
Apologies if I'm being obtuse, but:
Your code sample returns:
AttributeError: module 'muvs' has no attribute 'recorder'. Did you mean: 'Recorder'?'
After taking a look at your wiki, running my own function works:
from functools import partial
import vapoursynth as vs
import muvs
from muvs import core
import muvsfunc
import fvsfunc
muvs.pollute()
def test(clip: vs.VideoNode) -> vs.VideoNode:
from vsutil import iterate
return iterate(clip, partial(core.std.Invert), 5)
img = core.imwri.Read(r'banding.png')
img = core.resize.Bicubic(img, format=vs.YUV420P16, matrix_s='709').std.Loop(240)
with muvs.record("transformed_code.vpy") as recorder:
muvsfunc.GradFun3(img)
recorder.write("# label")
However if I replace test(img) with muvsfunc.GradFun3 or similar functions:
muvsfunc.GradFun3(img)
File "/usr/lib/python3.10/site-packages/muvsfunc.py", line 602, in GradFun3
src_8 = core.fmtc.bitdepth(src, bits=8, dmode=1, planes=[0]) if bits != 8 else src
File "/usr/lib/python3.10/site-packages/muvs.py", line 276, in closure
recorder.buffer.append(self._get_str(func, args, kwargs, output) + '\n')
File "/usr/lib/python3.10/site-packages/muvs.py", line 322, in _get_str
call_args = ', '.join(f"{k}={_repr(v)}" for k, v in args_dict.items() if v is not None)
File "/usr/lib/python3.10/site-packages/muvs.py", line 322, in <genexpr>
call_args = ', '.join(f"{k}={_repr(v)}" for k, v in args_dict.items() if v is not None)
File "/usr/lib/python3.10/site-packages/muvs.py", line 182, in closure
elif isinstance(obj, vs.Format):
AttributeError: module 'vapoursynth' has no attribute 'Format'
Ah, this should be vs.VideoFormat
I think. After changing this, I can run muvsfunc.GradFun3(img) in the example I've given. But your snippet still errors.
https://github.com/WolframRhodium/muvsfunc/blob/8221fbcc98a1b212623d5d53501cbafb922c7846/muvs.py#L182
Edit: Yes I was being obtuse. It's record not recorder. You must have unpushed changes. What a useful function!
Oh I make a typo. Thanks for your correction.
Gaussian blur can be implemented using tcanny.TCanny (which is what ex_GradFun3 seems to use)
Small correction, it appears that ex_GradFun3 is using ex_GaussianBlur which is an approximate gaussian blur and therefor faster than tcanny.
https://github.com/Dogway/Avisynth-Scripts/blob/master/ExTools.avsi#L1631
8 threads:
Output 1000 frames in 1.31 seconds (766.22 fps)
FPS (min | max | average): 102.4 | 909091 | 1145
Pay no mind to the 'max' speed of these results, as far as I can tell the average is accurate.
Also for some reason tcanny is 10% faster when using 8 threads instead of my native CPU (24)? tracking: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-TCanny/issues/12
Actually a friend of mine has pointed out that dither tools GradFun3 is even faster when using ConvertToStacked(). all testing is done with radius=9 for parity with fvsfunc unless otherwise stated.
ColorBars(width=1920, height=1080, pixel_type="YUV420P16")
Trim(0, 240)
dithertools = ConvertToStacked()
dithertools = GradFun3(dithertools, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0, lsb_in=true, lsb=true)
dithertools = ConvertFromStacked(dithertools)
exMod = ex_GradFun3(thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)
dithertools
PreFetch(8)
8 threads:
muvsfunc git master:
Output 240 frames in 3.66 seconds (65.58 fps)
dithertools
FPS (min | max | average): 5.833 | 909091 | 174.5
ex_GradFun3
FPS (min | max | average): 11.07 | 833333 | 121.4
hi, please take a look at this box blur implementation.
It is a plug-in replacement of std.BoxBlur
(=> box.Blur
), and runs faster on intel platform. (Slower on Zen3, sadly)
I have a Zen 2 CPU, so I may not be an ideal candidate. I'm not sure which of the modes at https://github.com/Dogway/Avisynth-Scripts/blob/master/ExTools.avsi#L957 are most equivalent. Please advise.
I have no idea how ex_boxblur could possibly be faster than std.Convolution?
Using the windows artifact for this.
import vapoursynth as vs
core = vs.core
core.num_threads = 16
img = core.std.BlankClip(width=1920, height=1080, format=vs.YUV420P16, fpsnum=30000, fpsden=1001, length=2000)
std = core.std.BoxBlur(img, planes=[0, 1, 2])
con = core.std.Convolution(img, matrix=[1]*9, planes=[0, 1, 2])
box = core.box.Blur(img, planes=[0, 1, 2])
std.set_output(1)
con.set_output(2)
box.set_output(3)
ColorBars(width=1920, height=1080, pixel_type="YUV420P16")
Trim(0, 2000)
ex_boxblur(1,mode="weighted",UV=3)
PreFetch(8)
FPS (min | max | average): 83.41 | 434783 | 1560
vspipe -p .\vs_boxblur.vpy -o 1 .
Script evaluation done in 0.04 seconds
Output 2000 frames in 5.11 seconds (391.64 fps)
vspipe -p .\vs_boxblur.vpy -o 2 .
Script evaluation done in 0.08 seconds
Output 2000 frames in 1.84 seconds (1085.66 fps)
vspipe -p .\vs_boxblur.vpy -o 3 .
Script evaluation done in 0.04 seconds
Output 2000 frames in 5.22 seconds (382.99 fps)
Thanks. Will replacing all std.BoxBlur
with box.Blur
in muvsfunc make your original script run faster? And what about smode=1
?
No, it does not appear that it would. Please re-evaluate my post.
I am facing the same issue that I have described here: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-TCanny/issues/12
Increasing thread counts past 6 causes CPU and memory usage to increase, but performance stays the same. Is this an upstream issue with Vapoursynth? Or is my methodology flawed?
core.box.Blur(img, planes=[0, 1, 2]):
core.num_threads = 6, approx 25% CPU usage and 250 MB of memory as reported by task manager
Output 2000 frames in 5.46 seconds (366.60 fps)
core.num_threads = 16, approx 50% CPU usage and 500 MB.
Output 2000 frames in 5.22 seconds (383.17 fps)
Vapoursynth classic is faster https://github.com/AmusementClub/VapourSynth-Portable-Maker But the issue persists.
core.num_threads = 6
.\VSPipe.exe -p .\boxblur.vpy -o 1 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 3.83 seconds (521.67 fps)
.\VSPipe.exe -p .\boxblur.vpy -o 2 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 1.37 seconds (1459.02 fps)
.\VSPipe.exe -p .\boxblur.vpy -o 3 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 4.11 seconds (486.77 fps)
core.num_threads = 16
.\VSPipe.exe -p .\boxblur.vpy -o 1 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 4.10 seconds (487.37 fps)
.\VSPipe.exe -p .\boxblur.vpy -o 2 .
Script evaluation done in 0.04 seconds
Output 2000 frames in 1.63 seconds (1228.84 fps)
.\VSPipe.exe -p .\boxblur.vpy -o 3 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 4.58 seconds (436.73 fps)
And what about
smode=1
?
muvsfunc.GradFun3 using smode=1 is faster than both Avisynth versions. Using upstream VS here (not VS-Classic)
smode=1:
dithertools: FPS (min | max | average): 0.824 | 357143 | 12.56
ex_GradFun3: FPS (min | max | average): 2.179 | 333333 | 19.85
muvsfunc.GradFun3
8 threads: Output 240 frames in 9.37 seconds (25.61 fps)
16 threads: Output 240 frames in 6.16 seconds (38.99 fps)
smode=2:
dithertools: FAIL
ex_GradFun3: FAIL
ex_bilateral: Radius should be 6 or below
muvsfunc.GradFun3
8 threads: Output 240 frames in 4.25 seconds (56.51 fps)
16 threads: Output 240 frames in 3.03 seconds (79.23 fps)
smode=3
dithertools: FPS (min | max | average): 9.336 | 370370 | 178.9
ex_GradFun3: FPS (min | max | average): 11.78 | 454545 | 135.8
muvsfunc.GradFun3
8 threads: Output 240 frames in 3.09 seconds (77.74 fps)
16 threads: Output 240 frames in 2.55 seconds (94.06 fps)
No, it does not appear that it would. Please re-evaluate my post.
I am facing the same issue that I have described here: HomeOfVapourSynthEvolution/VapourSynth-TCanny#12
Increasing thread counts past 6 causes CPU and memory usage to increase, but performance stays the same. Is this an upstream issue with Vapoursynth? Or is my methodology flawed?
core.box.Blur(img, planes=[0, 1, 2]): core.num_threads = 6, approx 25% CPU usage and 250 MB of memory as reported by task manager
Output 2000 frames in 5.46 seconds (366.60 fps)
core.num_threads = 16, approx 50% CPU usage and 500 MB.
Output 2000 frames in 5.22 seconds (383.17 fps)
Maybe core.max_cache_size
should be set to a larger threshold?
Above comment updated with other smodes.
Maybe
core.max_cache_size
should be set to a larger threshold?
Please advise, what is an acceptable 'larger' threshold? Using a 3900X.
I see no applicable difference between setting it to 1
, 50
, or 200
when using 16 threads.
Please advise, what is an acceptable 'larger' threshold? Using a 3900X. I see no applicable difference between setting it to
1
,50
, or200
when using 16 threads.
Something like 8000 or higher? The unit of it is megabyte.
Performance with 2000, 4000, 6000, 8000, and 12000 are within the expected deviation as when it is set to 1, 50, or 200
Please advise, what is an acceptable 'larger' threshold? Using a 3900X. I see no applicable difference between setting it to
1
,50
, or200
when using 16 threads.Something like 8000 or higher? The unit of it is megabyte.
After some discussion, from what I have been told it is not an issue of memory but CPU cache, which is why this did nothing I suppose. Calling std.BlankClip(length=n) is also slower than std.BlankClip(length=1)*n, or std.Loop(n). Would looping with core.std.BlankClip(clip, length=1).std.Loop(clip.num_frames)
result in a minor improvement across the board?
When running light filters my system (std.Convolution, std.BoxBlur, box.Blur) the ideal thread count is somewhere between 3 and 6 depending on the input depth (BoxBlur is faster than box.Blur with 12 threads, but slower with 4). Beyond 6 threads point performance degrades. It is only when using compute heavy filters (bm3dcpu.BM3D) or otherwise complex functions that I see a consistent correlation between thread count and processing speed.
I do not have an ideal way of testing filter performance between std.BoxBlur and box.Blur independently, so it might be best to just replace all the BoxBlur calls in relevant functions and run the numbers on those.
Similar discussion https://github.com/Asd-g/AviSynth-SmoothUV2/issues/2#issuecomment-877823281
muvsfunc.GradFun3 is between 5 and 10 fps faster in synthetic testing (depending on threads, difference is more noticeable at higher thread counts) when using std.BlankClip(length=1).std.Loop(n) over std.BlankClip(length=n).
Thanks. Will replacing all
std.BoxBlur
withbox.Blur
in muvsfunc make your original script run faster?
After replacing std.BoxBlur in BoxFilter, performance is significantly improved at low thread counts. This may be a more realistic use case, as other filters will be eating CPU resources in real scripts.
muvsfunc.GradFun3(clip, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)
:
std.BoxBlur
2 threads: Output 240 frames in 7.12 seconds (33.69 fps)
4 threads: Output 240 frames in 3.97 seconds (60.51 fps)
box.Blur
2 threads: Output 240 frames in 2.79 seconds (85.91 fps)
4 threads: Output 240 frames in 2.20 seconds (109.03 fps)
Would looping with
core.std.BlankClip(clip, length=1).std.Loop(clip.num_frames)
result in a minor improvement across the board?
It should be, because only a single frame is accessed.
After replacing std.BoxBlur in BoxFilter, performance is significantly improved at low thread counts. This may be a more realistic use case, as other filters will be eating CPU resources in real scripts.
muvsfunc.GradFun3(clip, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)
:std.BoxBlur 2 threads:
Output 240 frames in 7.12 seconds (33.69 fps)
4 threads:Output 240 frames in 3.97 seconds (60.51 fps)
box.Blur 2 threads:
Output 240 frames in 2.79 seconds (85.91 fps)
4 threads:Output 240 frames in 2.20 seconds (109.03 fps)
This result is better than I expected. Thanks for the detailed benchmark.
Make of this what you will.
Box blur with a radius of 1 or 2 is expected to be faster on std.Convolution
than the other two implementations. Which value of radius is tested?
import vapoursynth as vs
core = vs.core
clip = core.std.BlankClip(width=1920, height=1080, format=vs.YUV420P8, length=1)*2400
con = core.std.Convolution(clip, matrix=[1]*9, planes=[0, 1, 2])
std = core.std.BoxBlur(clip, planes=[0, 1, 2])
box = core.box.Blur(clip, planes=[0, 1, 2])
con.set_output(0)
std.set_output(1)
box.set_output(2)
This result is expected, which is why the python module should dispatch to the appropriate plugin:
muf.BoxBlur()
is already optimized for small radius
I understand this, but I am just providing more information about how different filters behave at n thread counts (on my CPU) - which is relevant for comparison with AVS filters.
It's strange that the performance of box.Blur
drops as the thread count increases. Maybe that's relevant to cache miss.
You can try vapoursynth-zboxblur, which uses the same boxblur as FFmpeg, thus avoiding the need for transpose.
Hi Wolfram,
It has come to my attention that your port of Gradfun3 is significantly slower than the Avisynth version and the modified version from https://github.com/Dogway/Avisynth-Scripts/tree/master/EX%20mods
To match the speed of the Avisynth version, significantly more CPU cycles must be used. Testing done on my Windows 10 VM (AVS on Linux is a nightmare!)
muvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=17, elast=3.0, elastc=3.0, mask=2, smode=0)
Output 240 frames in 29.42 seconds (8.16 fps)
Measured with https://forum.doom9.org/showthread.php?t=174797 AviSynth+ 3.7.1 (r3593, master, x86_64) (3.7.1.0):
GradFun3 (8-bit input)
FPS (min | max | average): 4.694 | 29.31 | 26.97
ex_GradFun3 (16-bit input)FPS (min | max | average): 15.34 | 31.50 | 23.37