Open Boulder08 opened 4 years ago
Only if AMD changed the way they signal AVX2 support. I don't think they did?
Because of the parameters you used, neither Super nor Degrain1 are using the new AVX2 code, which means it's Analyse that got slower.
Do you see a difference between v21 and v22 when you run Analyse on a 16 bit clip?
The difference seems to be consistent.
Analyse 16-bits, v22 26.22 fps Analyse 16-bits, v21 28.07 fps Analyse 8-bits, v22 55.74 fps Analyse 8-bits, v21 57.96 fps
Which functionalities in MSuper or MDegrainx should be optimized? I could test them as well.
Degrain with 8 bit clips, Super with sharp=0 or 2.
Same thing with those, v21 is faster.
sharp=2, v22 55.97 fps sharp=2, v21 58.17 fps sharp=2, Degrain 8-bits, v22 64.51 fps sharp=2, Degrain 8-bits, v21 66.67 fps
Just for fun, I checked what x264 shows: x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
So at least it's working properly.
Is v22 compiled with Visual Studio faster than v21? See attached.
Yes, it seems to be faster. Compared to those first tests with 8-bit Analyse and 16-bit degraining, I got 60.43 fps as the result.
Tried compiling with GCC 9 on Linux. v22 is running faster than v21 for me. Maybe the issue is related to MinGW and cross-compilation.
Script from Doom9 thread:
import vapoursynth as vs
core = vs.core
core.num_threads = 1
core.std.LoadPlugin("/home/user/src/vapoursynth-mvtools/.libs/libmvtools.so")
c = core.std.BlankClip(format=vs.YUV420P8) * 100
s = core.mv.Super(c, pel=2, chroma=True, rfilter=4, sharp=1)
kwargs = {"blksize": 16, "overlap": 8, "search": 5, "searchparam": 8, "pelsearch": 8, "truemotion": False}
b1v = core.mv.Analyse(s, isb=True, delta=1, **kwargs)
f1v = core.mv.Analyse(s, isb=False, delta=1, **kwargs)
kwargs = {"thsad": 200, "thsadc": 100, "limit": 1, "limitc": 2, "thscd1": 300, "thscd2": 80}
c = core.mv.Degrain1(c, s, b1v, f1v, **kwargs)
c.set_output()
Profiler results. Units are perf "cycles" events, which is a proxy for time. In this script, the AVX2 code is offering negligible speedup, because the bulk of the compute is not in SIMD code anyway, due to the mv.Super
mode. The fps gains are instead coming from templating and specializing the control flow for the motion estimation.
Kernels | |||
---|---|---|---|
sym | v21 | v23 | |
HorizontalBicubic |
34449 | 43948 | 1.275741 |
VerticalBicubic |
17062 | 17427 | 1.021393 |
ToPixels_uint16_t_uint8_t | 11051 | 12583 | 1.13863 |
SADWrapperU8_AVX2<16u, 16u>::sad_u8_avx2 | 6707 | 8806 | 1.312957 |
__memset_avx2_erms | 7171 | 7903 | 1.102078 |
SADWrapperU8<8u, 8u>::sad_u8_sse2 | 12621 | 6507 | 0.515569 |
Degrain_avx2<1, 16, 16> | 9277 | 6028 | 0.649779 |
Degrain_avx2<1, 8, 8> | 5395 | 4100 | 0.759963 |
RB2Cubic |
4513 | 3595 | 0.796588 |
copyBlock<16u, 16u> | 3974 | 3351 | 0.843231 |
overlaps_avx2<16, 16> | 5026 | 3166 | 0.629924 |
overlaps_avx2<8, 8> | 2753 | 2484 | 0.902288 |
copyBlock<8u, 8u> | 2755 | 2294 | 0.832668 |
__memmove_avx_unaligned_erms | 1934 | 1923 | 0.994312 |
PadReferenceFrame |
895 | 1206 | 1.347486 |
LimitChanges_sse2 | 930 | 914 | 0.982796 |
126513 | 126235 | 0.997803 | |
Control Flow | |||
v21 | |||
pobExpandingSearch | 41311 | ||
pobSearchMVs | 32305 | ||
pobUMHSearch | 25482 | ||
mvdegrainGetFrame<1> | 17290 | ||
pobInterpolatePrediction | 11247 | ||
mvpGetAbsolutePointerPel2 | 3681 | ||
pobHex2Search | 2951 | ||
pobLumaSAD | 2006 | ||
mvpGetAbsolutePointerPel1 | 1989 | ||
mvpGetAbsolutePointer | 1455 | ||
pobRefine | 1331 | ||
SUM | 141048 | ||
v23 | |||
pobExpandingSearch<0, 0> | 36834 | ||
pobUMHSearch<0, 1> | 28107 | ||
mvdegrainGetFrame<1> | 13456 | ||
doPobSearchMVs<0, 1> | 11239 | ||
pobFetchPredictors | 6858 | ||
pobInterpolatePrediction | 5472 | ||
pobExpandingSearch<0, 1> | 4954 | ||
doPobSearchMVs<0, 0> | 3511 | ||
mvpGetAbsolutePointerPel2 | 2903 | ||
pobHex2Search<0, 1> | 2792 | ||
pobGetRefBlockU<1> | 1970 | ||
mvpGetAbsolutePointerPel1 | 1938 | ||
pobGetRefBlockV<1> | 1883 | ||
mvpGetAbsolutePointer | 1606 | ||
pobRefine<0, 1> | 802 | ||
SUM | 124325 |
Which compiler flags did you use? (And Autotools or Meson?)
Default autotools build (./configure && make
).
Hmm. The default with Makefile.am is -O2. Meson defaults to -O3. I compiled the v22 and v23 DLLs using Meson. (I don't know about the older ones.) Perhaps that's what makes it slower?
I did some test with the above script and for me r22 and r23 are slightly faster than r21 (~4%).
GCC 10 builds are ~10% bigger than GCC 9 but just a tiny bit faster (~2%).
On my zen2 CPU I used -march=native -O2 -ftree-vectorize -fdevirtualize-at-ltrans -flto=16 -pipe
but -O2
for GCC 10 is slightly different (it includes -finline-functions
now).
@Boulder08 Here is v23 compiled with -O2 instead of -O3. That's the only difference. Please test again. vapoursynth-mvtools-v23-O2-win64.zip
As I measured here: https://forum.doom9.org/showthread.php?p=1910541#post1910541 , the new version with speed improvements seems to be slower than the previous one. Are the CPU instruction sets properly detected? I noticed that the part doing the job is quite old and may not be up to it with these new-gen AMD Ryzens (I'm running a 3900X).