dubhater / vapoursynth-mvtools

Motion compensation and stuff
181 stars 27 forks source link

v22 slower than v21? #47

Open Boulder08 opened 4 years ago

Boulder08 commented 4 years ago

As I measured here: https://forum.doom9.org/showthread.php?p=1910541#post1910541 , the new version with speed improvements seems to be slower than the previous one. Are the CPU instruction sets properly detected? I noticed that the part doing the job is quite old and may not be up to it with these new-gen AMD Ryzens (I'm running a 3900X).

dubhater commented 4 years ago

Only if AMD changed the way they signal AVX2 support. I don't think they did?

Because of the parameters you used, neither Super nor Degrain1 are using the new AVX2 code, which means it's Analyse that got slower.

Do you see a difference between v21 and v22 when you run Analyse on a 16 bit clip?

Boulder08 commented 4 years ago

The difference seems to be consistent.

Analyse 16-bits, v22 26.22 fps Analyse 16-bits, v21 28.07 fps Analyse 8-bits, v22 55.74 fps Analyse 8-bits, v21 57.96 fps

Which functionalities in MSuper or MDegrainx should be optimized? I could test them as well.

dubhater commented 4 years ago

Degrain with 8 bit clips, Super with sharp=0 or 2.

Boulder08 commented 4 years ago

Same thing with those, v21 is faster.

sharp=2, v22 55.97 fps sharp=2, v21 58.17 fps sharp=2, Degrain 8-bits, v22 64.51 fps sharp=2, Degrain 8-bits, v21 66.67 fps

Boulder08 commented 4 years ago

Just for fun, I checked what x264 shows: x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

So at least it's working properly.

sekrit-twc commented 4 years ago

Is v22 compiled with Visual Studio faster than v21? See attached.

vapoursynth-mvtools.zip

Boulder08 commented 4 years ago

Yes, it seems to be faster. Compared to those first tests with 8-bit Analyse and 16-bit degraining, I got 60.43 fps as the result.

sekrit-twc commented 4 years ago

Tried compiling with GCC 9 on Linux. v22 is running faster than v21 for me. Maybe the issue is related to MinGW and cross-compilation.

Script from Doom9 thread:

import vapoursynth as vs

core = vs.core
core.num_threads = 1

core.std.LoadPlugin("/home/user/src/vapoursynth-mvtools/.libs/libmvtools.so")

c = core.std.BlankClip(format=vs.YUV420P8) * 100
s = core.mv.Super(c, pel=2, chroma=True, rfilter=4, sharp=1)

kwargs = {"blksize": 16, "overlap": 8, "search": 5, "searchparam": 8, "pelsearch": 8, "truemotion": False}
b1v = core.mv.Analyse(s, isb=True, delta=1, **kwargs)
f1v = core.mv.Analyse(s, isb=False, delta=1, **kwargs)

kwargs = {"thsad": 200, "thsadc": 100, "limit": 1, "limitc": 2, "thscd1": 300, "thscd2": 80}
c = core.mv.Degrain1(c, s, b1v, f1v, **kwargs)
c.set_output()

Profiler results. Units are perf "cycles" events, which is a proxy for time. In this script, the AVX2 code is offering negligible speedup, because the bulk of the compute is not in SIMD code anyway, due to the mv.Super mode. The fps gains are instead coming from templating and specializing the control flow for the motion estimation.

Kernels      
sym v21 v23  
HorizontalBicubic 34449 43948 1.275741
VerticalBicubic 17062 17427 1.021393
ToPixels_uint16_t_uint8_t 11051 12583 1.13863
SADWrapperU8_AVX2<16u, 16u>::sad_u8_avx2 6707 8806 1.312957
__memset_avx2_erms 7171 7903 1.102078
SADWrapperU8<8u, 8u>::sad_u8_sse2 12621 6507 0.515569
Degrain_avx2<1, 16, 16> 9277 6028 0.649779
Degrain_avx2<1, 8, 8> 5395 4100 0.759963
RB2Cubic 4513 3595 0.796588
copyBlock<16u, 16u> 3974 3351 0.843231
overlaps_avx2<16, 16> 5026 3166 0.629924
overlaps_avx2<8, 8> 2753 2484 0.902288
copyBlock<8u, 8u> 2755 2294 0.832668
__memmove_avx_unaligned_erms 1934 1923 0.994312
PadReferenceFrame 895 1206 1.347486
LimitChanges_sse2 930 914 0.982796
  126513 126235 0.997803
       
Control Flow      
v21      
pobExpandingSearch 41311    
pobSearchMVs 32305    
pobUMHSearch 25482    
mvdegrainGetFrame<1> 17290    
pobInterpolatePrediction 11247    
mvpGetAbsolutePointerPel2 3681    
pobHex2Search 2951    
pobLumaSAD 2006    
mvpGetAbsolutePointerPel1 1989    
mvpGetAbsolutePointer 1455    
pobRefine 1331    
SUM 141048    
       
v23      
pobExpandingSearch<0, 0> 36834    
pobUMHSearch<0, 1> 28107    
mvdegrainGetFrame<1> 13456    
doPobSearchMVs<0, 1> 11239    
pobFetchPredictors 6858    
pobInterpolatePrediction 5472    
pobExpandingSearch<0, 1> 4954    
doPobSearchMVs<0, 0> 3511    
mvpGetAbsolutePointerPel2 2903    
pobHex2Search<0, 1> 2792    
pobGetRefBlockU<1> 1970    
mvpGetAbsolutePointerPel1 1938    
pobGetRefBlockV<1> 1883    
mvpGetAbsolutePointer 1606    
pobRefine<0, 1> 802    
SUM 124325    
dubhater commented 4 years ago

Which compiler flags did you use? (And Autotools or Meson?)

sekrit-twc commented 4 years ago

Default autotools build (./configure && make).

dubhater commented 4 years ago

Hmm. The default with Makefile.am is -O2. Meson defaults to -O3. I compiled the v22 and v23 DLLs using Meson. (I don't know about the older ones.) Perhaps that's what makes it slower?

4re commented 4 years ago

I did some test with the above script and for me r22 and r23 are slightly faster than r21 (~4%).

GCC 10 builds are ~10% bigger than GCC 9 but just a tiny bit faster (~2%).

On my zen2 CPU I used -march=native -O2 -ftree-vectorize -fdevirtualize-at-ltrans -flto=16 -pipe but -O2 for GCC 10 is slightly different (it includes -finline-functions now).

dubhater commented 4 years ago

@Boulder08 Here is v23 compiled with -O2 instead of -O3. That's the only difference. Please test again. vapoursynth-mvtools-v23-O2-win64.zip

Boulder08 commented 4 years ago

2500 frames of a test script of analysis and degraining in 16 bits: v23-normal: 12.62 fps v23-O2: 12.07 fps v23-clang build from doom9 : 13.32 fps

So it was definitely slower.