Potential for optimization: Gradfun3

NSQY commented 2 years ago

Hi Wolfram,

It has come to my attention that your port of Gradfun3 is significantly slower than the Avisynth version and the modified version from https://github.com/Dogway/Avisynth-Scripts/tree/master/EX%20mods

To match the speed of the Avisynth version, significantly more CPU cycles must be used. Testing done on my Windows 10 VM (AVS on Linux is a nightmare!)

Python 3.9.5 (tags/v3.9.5:0a7dcbd, May  3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> import vapoursynth
>>> print(vapoursynth.core.version())
VapourSynth Video Processing Library
Copyright (c) 2012-2021 Fredrik Mellbin
Core R57
API R4.0
API R3.6
Options: -

>>>

muvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=17, elast=3.0, elastc=3.0, mask=2, smode=0) Output 240 frames in 29.42 seconds (8.16 fps)

Measured with https://forum.doom9.org/showthread.php?t=174797 AviSynth+ 3.7.1 (r3593, master, x86_64) (3.7.1.0):

GradFun3 (8-bit input) FPS (min | max | average): 4.694 | 29.31 | 26.97 ex_GradFun3 (16-bit input) FPS (min | max | average): 15.34 | 31.50 | 23.37

import vapoursynth as vs
core = vs.core
core.num_threads = 1

import muvsfunc

img = core.imwri.Read(r'banding.png')
img = core.resize.Bicubic(img, format=vs.YUV420P16, matrix_s='709')*240
gradfun = muvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=17, elast=3.0, elastc=3.0, mask=2, smode=0)
gradfun.set_output()

ImageSource("banding.png")
Trim(0, 240)
ConvertToYUV420(matrix="709")

exMod = ConvertBits(16)
exMod = ex_GradFun3(exMod, thr=0.35, thrc=0.35, radius=17, elast=3.0, elastc=3.0, mask=2, smode=0)
exMod

NSQY commented 2 years ago

fvsfunc is also slightly faster: Output 240 frames in 26.83 seconds (8.94 fps)

Here is my input file, but any YUV420P16 clip should do I guess.

oXK3

WolframRhodium commented 2 years ago

Why is core.num_threads = 1 specified?

NSQY commented 2 years ago

Because the AVS script is single threaded. It would not be a fair comparison otherwise.

What I am trying to say is that both Vapoursynth versions of GradFun3 use more CPU time than the Avisynth version to preform a similar action.

Here is performance scaling with PreFetch Avisynth seems to start misbehaving if PreFetch is set too high.

4 threads: Output 240 frames in 8.32 seconds (28.84 fps) FPS (min | max | average): 43.55 | 91.40 | 78.81

8 threads: Output 240 frames in 4.58 seconds (52.42 fps) FPS (min | max | average): 32.86 | 1289 | 97.96

12 threads: Output 240 frames in 3.59 seconds (66.87 fps) FPS (min | max | average): 21.68 | 692.2 | 79.80

16 threads: Output 240 frames in 3.26 seconds (73.56 fps) FPS (min | max | average): 1.562 | 833333 | 41.05

Replaced RGB input with BlankClip and ColorBarsHD()

1 thread: Output 240 frames in 26.24 seconds (9.15 fps) FPS (min | max | average): 17.37 | 31.17 | 24.21

4 threads: Output 240 frames in 7.08 seconds (33.88 fps) FPS (min | max | average): 22.15 | 277778 | 85.89

8 threads: Output 240 frames in 3.99 seconds (60.08 fps) FPS (min | max | average): 12.07 | 476191 | 122.3

12 threads: Output 240 frames in 3.21 seconds (74.78 fps) FPS (min | max | average): 7.720 | 909092 | 119.4

16 threads: Output 240 frames in 2.97 seconds (80.69 fps) FPS (min | max | average): 4.151 | 769231 | 78.37

NSQY commented 2 years ago

One thing I noticed when looking at --filter-time output is that this function uses a lot of BoxBlur calls. std.Convolution is much faster when boxblur is using a radius if 1. I do not know about other radii.

import vapoursynth as vs
core = vs.core

clip = core.std.BlankClip(width=1920, height=1080, format=vs.YUV420P16, length=240)

box = core.std.BoxBlur(clip, hradius=1, hpasses=1, vradius=1, vpasses=1, planes=[0, 1, 2])
con = core.std.Convolution(clip, matrix=[1]*9, planes=[0, 1, 2])

box.set_output(1)
con.set_output(2)

vspipe -p boxblur.vpy -o 1 .
Script evaluation done in 0.07 seconds
Output 240 frames in 0.54 seconds (443.65 fps)

vspipe -p boxblur.vpy -o 2 .
Script evaluation done in 0.09 seconds
Output 240 frames in 0.23 seconds (1031.13 fps)

WolframRhodium commented 2 years ago

fvsfunc raises ValueError: GradFun3: "radius" must be in 2-9 for smode=0 or 3 !.

muf.BoxBlur() is already optimized for small radius https://github.com/WolframRhodium/muvsfunc/blob/0f9b43e6cadd05d3e617231e6b8d6e27cb24a059/muvsfunc.py#L1888-L1889

I do notice a bug in my implementation and it will be fixed later.

As for the single-threaded performance, I don't think VapourSynth is designed to optimize for latency rather than throughput.

NSQY commented 2 years ago

As for the single-threaded performance, I don't think VapourSynth is designed to optimize for latency rather than throughput.

You make a salient point. I had forgotten about PreFetch(). However, as per my updated comment muvsfunc.GradFun3 with 16 threads is slower than Avisynth with 4.

muf.BoxBlur() is already optimized for small radius

I missed this. Thanks.

fvsfunc raises ValueError: GradFun3: "radius" must be in 2-9 for smode=0 or 3 !.

We can just set radius to 9 for this, but difference between muvsfunc and fvsfunc seems to be a fmtc.bitdepth call.

for x in {1..2}
time vspipe -p gradfun.vpy -o $x .
Script evaluation done in 0.10 seconds
Output 240 frames in 2.54 seconds (94.65 fps)
vspipe -p gradfun.vpy -o $x .   53.70s  user 1.48s system 1930% cpu 2.858 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                1631 MB
page faults from disk:     0
other page faults:         690306

Script evaluation done in 0.08 seconds
Output 240 frames in 2.58 seconds (92.85 fps)
vspipe -p gradfun.vpy -o $x .   52.82s  user 1.66s system 1869% cpu 2.914 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                1708 MB
page faults from disk:     0
other page faults:         735205

WolframRhodium commented 2 years ago

Thanks for the details. It seems that fmtc.bitdepth is a bottleneck. The performance should be improved after a later commit.

Anyway, std.BoxBlur is also a bottleneck, and ex_GradFun3 utilizes a different implementatioin (GaussianBlur).

NSQY commented 2 years ago

Relevant https://github.com/vapoursynth/vapoursynth/issues/787

WolframRhodium commented 2 years ago

Gaussian blur can be implemented using tcanny.TCanny (which is what ex_GradFun3 seems to use), but I want to keep the function in its original form as much as possible.

In terms of fvsfunc's implementation, it should be equivalent to the current implementation:

import muvs
from muvs import core

import muvsfunc
import fvsfunc
muvs.pollute()

img = core.imwri.Read(r'banding.png')
img = core.resize.Bicubic(img, format=vs.YUV420P16, matrix_s='709').std.Loop(240)

with open("test1.vpy", "w") as f:
    with muvs.record(f):
        gradfun = muvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)

with open("test2.vpy", "w") as f:
    with muvs.record(f):
        gradfun2 = fvsfunc.GradFun3(img, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)

gradfun.set_output()

Just run this script and checks the two output files, which record the actual function calls.

NSQY commented 2 years ago

It's a shame --filter-time in vspipe isn't more developed.

Apologies if I'm being obtuse, but:

Your code sample returns: AttributeError: module 'muvs' has no attribute 'recorder'. Did you mean: 'Recorder'?'

After taking a look at your wiki, running my own function works:

from functools import partial
import vapoursynth as vs

import muvs
from muvs import core

import muvsfunc
import fvsfunc

muvs.pollute()

def test(clip: vs.VideoNode) -> vs.VideoNode:
    from vsutil import iterate

    return iterate(clip, partial(core.std.Invert), 5)

img = core.imwri.Read(r'banding.png')
img = core.resize.Bicubic(img, format=vs.YUV420P16, matrix_s='709').std.Loop(240)

with muvs.record("transformed_code.vpy") as recorder:
    muvsfunc.GradFun3(img)
    recorder.write("# label")

However if I replace test(img) with muvsfunc.GradFun3 or similar functions:

    muvsfunc.GradFun3(img)
  File "/usr/lib/python3.10/site-packages/muvsfunc.py", line 602, in GradFun3
    src_8 = core.fmtc.bitdepth(src, bits=8, dmode=1, planes=[0]) if bits != 8 else src
  File "/usr/lib/python3.10/site-packages/muvs.py", line 276, in closure
    recorder.buffer.append(self._get_str(func, args, kwargs, output) + '\n')
  File "/usr/lib/python3.10/site-packages/muvs.py", line 322, in _get_str
    call_args = ', '.join(f"{k}={_repr(v)}" for k, v in args_dict.items() if v is not None)
  File "/usr/lib/python3.10/site-packages/muvs.py", line 322, in <genexpr>
    call_args = ', '.join(f"{k}={_repr(v)}" for k, v in args_dict.items() if v is not None)
  File "/usr/lib/python3.10/site-packages/muvs.py", line 182, in closure
    elif isinstance(obj, vs.Format):
AttributeError: module 'vapoursynth' has no attribute 'Format'

NSQY commented 2 years ago

Ah, this should be vs.VideoFormat I think. After changing this, I can run muvsfunc.GradFun3(img) in the example I've given. ~~But your snippet still errors.~~ https://github.com/WolframRhodium/muvsfunc/blob/8221fbcc98a1b212623d5d53501cbafb922c7846/muvs.py#L182

Edit: Yes I was being obtuse. It's record not recorder. You must have unpushed changes. What a useful function!

WolframRhodium commented 2 years ago

Oh I make a typo. Thanks for your correction.

NSQY commented 2 years ago

Gaussian blur can be implemented using tcanny.TCanny (which is what ex_GradFun3 seems to use)

Small correction, it appears that ex_GradFun3 is using ex_GaussianBlur which is an approximate gaussian blur and therefor faster than tcanny.

https://github.com/Dogway/Avisynth-Scripts/blob/master/ExTools.avsi#L1631

8 threads: Output 1000 frames in 1.31 seconds (766.22 fps) FPS (min | max | average): 102.4 | 909091 | 1145

Pay no mind to the 'max' speed of these results, as far as I can tell the average is accurate.

Also for some reason tcanny is 10% faster when using 8 threads instead of my native CPU (24)? tracking: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-TCanny/issues/12

NSQY commented 2 years ago

Actually a friend of mine has pointed out that dither tools GradFun3 is even faster when using ConvertToStacked(). all testing is done with radius=9 for parity with fvsfunc unless otherwise stated.

ColorBars(width=1920, height=1080, pixel_type="YUV420P16")
Trim(0, 240)

dithertools = ConvertToStacked()
dithertools = GradFun3(dithertools, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0, lsb_in=true, lsb=true)
dithertools = ConvertFromStacked(dithertools)

exMod = ex_GradFun3(thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0)

dithertools
PreFetch(8)

8 threads:

muvsfunc git master: Output 240 frames in 3.66 seconds (65.58 fps)

dithertools FPS (min | max | average): 5.833 | 909091 | 174.5

ex_GradFun3 FPS (min | max | average): 11.07 | 833333 | 121.4

WolframRhodium commented 2 years ago

hi, please take a look at this box blur implementation.

It is a plug-in replacement of std.BoxBlur (=> box.Blur), and runs faster on intel platform. (Slower on Zen3, sadly)

NSQY commented 2 years ago

I have a Zen 2 CPU, so I may not be an ideal candidate. I'm not sure which of the modes at https://github.com/Dogway/Avisynth-Scripts/blob/master/ExTools.avsi#L957 are most equivalent. Please advise.

I have no idea how ex_boxblur could possibly be faster than std.Convolution?

Using the windows artifact for this.

import vapoursynth as vs
core = vs.core
core.num_threads = 16

img = core.std.BlankClip(width=1920, height=1080, format=vs.YUV420P16, fpsnum=30000, fpsden=1001, length=2000)

std = core.std.BoxBlur(img, planes=[0, 1, 2])
con = core.std.Convolution(img, matrix=[1]*9, planes=[0, 1, 2])
box = core.box.Blur(img, planes=[0, 1, 2])

std.set_output(1)
con.set_output(2)
box.set_output(3)

ColorBars(width=1920, height=1080, pixel_type="YUV420P16")
Trim(0, 2000)

ex_boxblur(1,mode="weighted",UV=3) 
PreFetch(8)

FPS (min | max | average):          83.41 | 434783 | 1560

vspipe -p .\vs_boxblur.vpy -o 1 .
Script evaluation done in 0.04 seconds
Output 2000 frames in 5.11 seconds (391.64 fps)

vspipe -p .\vs_boxblur.vpy -o 2 .
Script evaluation done in 0.08 seconds
Output 2000 frames in 1.84 seconds (1085.66 fps)

vspipe -p .\vs_boxblur.vpy -o 3 .
Script evaluation done in 0.04 seconds
Output 2000 frames in 5.22 seconds (382.99 fps)

WolframRhodium commented 2 years ago

Thanks. Will replacing all std.BoxBlur with box.Blur in muvsfunc make your original script run faster? And what about smode=1?

NSQY commented 2 years ago

No, it does not appear that it would. Please re-evaluate my post.

I am facing the same issue that I have described here: https://github.com/HomeOfVapourSynthEvolution/VapourSynth-TCanny/issues/12

Increasing thread counts past 6 causes CPU and memory usage to increase, but performance stays the same. Is this an upstream issue with Vapoursynth? Or is my methodology flawed?

core.box.Blur(img, planes=[0, 1, 2]): core.num_threads = 6, approx 25% CPU usage and 250 MB of memory as reported by task manager Output 2000 frames in 5.46 seconds (366.60 fps)

core.num_threads = 16, approx 50% CPU usage and 500 MB. Output 2000 frames in 5.22 seconds (383.17 fps)

NSQY commented 2 years ago

Vapoursynth classic is faster https://github.com/AmusementClub/VapourSynth-Portable-Maker But the issue persists.

core.num_threads = 6

.\VSPipe.exe -p .\boxblur.vpy -o 1 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 3.83 seconds (521.67 fps)
.\VSPipe.exe -p .\boxblur.vpy -o 2 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 1.37 seconds (1459.02 fps)
 .\VSPipe.exe -p .\boxblur.vpy -o 3 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 4.11 seconds (486.77 fps)

core.num_threads = 16

 .\VSPipe.exe -p .\boxblur.vpy -o 1 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 4.10 seconds (487.37 fps)
 .\VSPipe.exe -p .\boxblur.vpy -o 2 .
Script evaluation done in 0.04 seconds
Output 2000 frames in 1.63 seconds (1228.84 fps)
 .\VSPipe.exe -p .\boxblur.vpy -o 3 .
Script evaluation done in 0.03 seconds
Output 2000 frames in 4.58 seconds (436.73 fps)

NSQY commented 2 years ago

And what about smode=1?

muvsfunc.GradFun3 using smode=1 is faster than both Avisynth versions. Using upstream VS here (not VS-Classic)

smode=1:

dithertools: FPS (min | max | average): 0.824 | 357143 | 12.56 ex_GradFun3: FPS (min | max | average): 2.179 | 333333 | 19.85

muvsfunc.GradFun3 8 threads: Output 240 frames in 9.37 seconds (25.61 fps) 16 threads: Output 240 frames in 6.16 seconds (38.99 fps)

smode=2:

dithertools: FAIL ex_GradFun3: FAIL ex_bilateral: Radius should be 6 or below

muvsfunc.GradFun3 8 threads: Output 240 frames in 4.25 seconds (56.51 fps) 16 threads: Output 240 frames in 3.03 seconds (79.23 fps)

smode=3

dithertools: FPS (min | max | average): 9.336 | 370370 | 178.9 ex_GradFun3: FPS (min | max | average): 11.78 | 454545 | 135.8

muvsfunc.GradFun3 8 threads: Output 240 frames in 3.09 seconds (77.74 fps) 16 threads: Output 240 frames in 2.55 seconds (94.06 fps)

WolframRhodium commented 2 years ago

No, it does not appear that it would. Please re-evaluate my post.

I am facing the same issue that I have described here: HomeOfVapourSynthEvolution/VapourSynth-TCanny#12

Increasing thread counts past 6 causes CPU and memory usage to increase, but performance stays the same. Is this an upstream issue with Vapoursynth? Or is my methodology flawed?

core.box.Blur(img, planes=[0, 1, 2]): core.num_threads = 6, approx 25% CPU usage and 250 MB of memory as reported by task manager Output 2000 frames in 5.46 seconds (366.60 fps)

core.num_threads = 16, approx 50% CPU usage and 500 MB. Output 2000 frames in 5.22 seconds (383.17 fps)

Maybe core.max_cache_size should be set to a larger threshold?

NSQY commented 2 years ago

Above comment updated with other smodes.

Maybe core.max_cache_size should be set to a larger threshold?

Please advise, what is an acceptable 'larger' threshold? Using a 3900X. I see no applicable difference between setting it to 1, 50, or 200 when using 16 threads.

WolframRhodium commented 2 years ago

Please advise, what is an acceptable 'larger' threshold? Using a 3900X. I see no applicable difference between setting it to 1, 50, or 200 when using 16 threads.

Something like 8000 or higher? The unit of it is megabyte.

NSQY commented 2 years ago

Performance with 2000, 4000, 6000, 8000, and 12000 are within the expected deviation as when it is set to 1, 50, or 200

NSQY commented 2 years ago

Please advise, what is an acceptable 'larger' threshold? Using a 3900X. I see no applicable difference between setting it to 1, 50, or 200 when using 16 threads.

Something like 8000 or higher? The unit of it is megabyte.

After some discussion, from what I have been told it is not an issue of memory but CPU cache, which is why this did nothing I suppose. Calling std.BlankClip(length=n) is also slower than std.BlankClip(length=1)*n, or std.Loop(n). Would looping with core.std.BlankClip(clip, length=1).std.Loop(clip.num_frames) result in a minor improvement across the board?

When running light filters my system (std.Convolution, std.BoxBlur, box.Blur) the ideal thread count is somewhere between 3 and 6 depending on the input depth (BoxBlur is faster than box.Blur with 12 threads, but slower with 4). Beyond 6 threads point performance degrades. It is only when using compute heavy filters (bm3dcpu.BM3D) or otherwise complex functions that I see a consistent correlation between thread count and processing speed.

I do not have an ideal way of testing filter performance between std.BoxBlur and box.Blur independently, so it might be best to just replace all the BoxBlur calls in relevant functions and run the numbers on those.

muvsfunc.GradFun3 is between 5 and 10 fps faster in synthetic testing (depending on threads, difference is more noticeable at higher thread counts) when using std.BlankClip(length=1).std.Loop(n) over std.BlankClip(length=n).

NSQY commented 2 years ago

Thanks. Will replacing all std.BoxBlur with box.Blur in muvsfunc make your original script run faster?

After replacing std.BoxBlur in BoxFilter, performance is significantly improved at low thread counts. This may be a more realistic use case, as other filters will be eating CPU resources in real scripts.

muvsfunc.GradFun3(clip, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0):

std.BoxBlur 2 threads: Output 240 frames in 7.12 seconds (33.69 fps) 4 threads: Output 240 frames in 3.97 seconds (60.51 fps)

box.Blur 2 threads: Output 240 frames in 2.79 seconds (85.91 fps) 4 threads: Output 240 frames in 2.20 seconds (109.03 fps)

WolframRhodium commented 2 years ago

Would looping with core.std.BlankClip(clip, length=1).std.Loop(clip.num_frames) result in a minor improvement across the board?

It should be, because only a single frame is accessed.

After replacing std.BoxBlur in BoxFilter, performance is significantly improved at low thread counts. This may be a more realistic use case, as other filters will be eating CPU resources in real scripts.

muvsfunc.GradFun3(clip, thr=0.35, thrc=0.35, radius=9, elast=3.0, elastc=3.0, mask=2, smode=0):

std.BoxBlur 2 threads: Output 240 frames in 7.12 seconds (33.69 fps) 4 threads: Output 240 frames in 3.97 seconds (60.51 fps)

box.Blur 2 threads: Output 240 frames in 2.79 seconds (85.91 fps) 4 threads: Output 240 frames in 2.20 seconds (109.03 fps)

This result is better than I expected. Thanks for the detailed benchmark.

NSQY commented 2 years ago

Make of this what you will.

chart2

WolframRhodium commented 2 years ago

Box blur with a radius of 1 or 2 is expected to be faster on std.Convolution than the other two implementations. Which value of radius is tested?

NSQY commented 2 years ago

import vapoursynth as vs
core = vs.core

clip = core.std.BlankClip(width=1920, height=1080, format=vs.YUV420P8, length=1)*2400

con = core.std.Convolution(clip, matrix=[1]*9, planes=[0, 1, 2])
std = core.std.BoxBlur(clip, planes=[0, 1, 2])
box = core.box.Blur(clip, planes=[0, 1, 2])

con.set_output(0)
std.set_output(1)
box.set_output(2)

WolframRhodium commented 2 years ago

This result is expected, which is why the python module should dispatch to the appropriate plugin:

muf.BoxBlur() is already optimized for small radius

https://github.com/WolframRhodium/muvsfunc/blob/0f9b43e6cadd05d3e617231e6b8d6e27cb24a059/muvsfunc.py#L1888-L1889

NSQY commented 2 years ago

I understand this, but I am just providing more information about how different filters behave at n thread counts (on my CPU) - which is relevant for comparison with AVS filters.

WolframRhodium commented 2 years ago

It's strange that the performance of box.Blur drops as the thread count increases. Maybe that's relevant to cache miss.

dnjulek commented 11 months ago

You can try vapoursynth-zboxblur, which uses the same boxblur as FFmpeg, thus avoiding the need for transpose.

WolframRhodium / muvsfunc

Potential for optimization: Gradfun3 #42