FlowInter: Artifacts at left borders when opt=True

HolyWu commented 5 years ago

super = core.mv.Super(clip)
mvbw = core.mv.Analyse(super, blksize=16, isb=True, overlap=8)
mvfw = core.mv.Analyse(super, blksize=16, isb=False, overlap=8)
clip = core.std.Interleave([core.mv.FlowInter(clip, super, mvbw, mvfw, opt=False), core.mv.FlowInter(clip, super, mvbw, mvfw, opt=True)])

opt=False

opt=True

dubhater commented 5 years ago

Is the clip 8 bit or 16 bit?

Does the same happen with v20?

HolyWu commented 5 years ago

Both 8 and 16 bit have the artifacts. I just tried v20 and it works fine.

BTW, the pixels at right borders are also different.

dubhater commented 5 years ago

Byte order remains difficult to figure out.

I think these two variables are initialised backwards: https://github.com/dubhater/vapoursynth-mvtools/blob/master/src/SimpleResize_AVX2.cpp#L150-L181

If the first parameter to _mm256_set_epi16 sets the least significant word, and the first word in the 16 words loaded by _mm256_loadu_si256 ends up in the least significant position, then those two above definitely backwards. The first column of the "image" needs a minimum of 0 and maximum of simple->limit_width - 0) * pel - 1.

What do you think?

HolyWu commented 5 years ago

I think the first parameter to _mm256_set_epi16 actually sets the most significant word, while that to _mm256_setr_epi16 sets the least significant word.

Here using 128-bit vector for simplicity.

#include <cstdio>
#include <emmintrin.h>

int main(int argc, char ** argv) {
    const __m128i vec = _mm_set_epi16(0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef);

    short ret1[8];
    _mm_storeu_si128((__m128i *)ret1, vec);
    for (int i = 0; i < 8; i++)
        printf("%#.4x\n", ret1[i]);
    printf("\n");

    int * ret2 = (int *)ret1;
    for (int i = 0; i < 4; i++)
        printf("%#.8x\n", ret2[i]);
    printf("\n");

    const short foo[8] = { 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7 };
    const __m128i bar = _mm_loadu_si128((const __m128i *)foo);
    for (int i = 0; i < 4; i++)
        printf("%#.8x\n", bar.m128i_i32[i]);

    return 0;
}

Output:

dubhater commented 5 years ago

Hmm, then those two variables are not the problem.

dubhater commented 5 years ago

I should mention that I don't have AVX2, so I can only read the code.

HolyWu commented 5 years ago

The initial_horizontal_minimum variable was indeed the problem. It was initialized with wrong values. The values should be negative (0, -1, -2, -3, etc.) rather than positive. Fixing that eliminates the artifacts at left borders, but the pixel differences at right borders still persist. No clue what's the culprit at the moment...

dubhater commented 5 years ago

Oh, yeah, they're supposed to be negative.

dubhater commented 4 years ago

In today's episode of "What was I thinking when I wrote that code??"...

Does the problem at the right border go away with this patch? I had a look in gdb and things seem to be correct now, but more testing can't hurt.

diff --git a/src/SimpleResize_AVX2.cpp b/src/SimpleResize_AVX2.cpp
index 39cb01f..86fd295 100644
--- a/src/SimpleResize_AVX2.cpp
+++ b/src/SimpleResize_AVX2.cpp
@@ -121,12 +121,14 @@ static FORCE_INLINE void simpleResize_int16_t_horizontal_8px_avx2(const SimpleRe
     pixels = _mm256_madd_epi16(pixels, dwords_weights_h);
     pixels = _mm256_add_epi32(pixels, _mm256_set1_epi32(simple_resize_weight_half));
     pixels = _mm256_srai_epi32(pixels, simple_resize_weight_shift);
+
+    pixels = _mm256_max_epi32(minimum,
+                              _mm256_min_epi32(pixels, maximum));
+
     pixels = _mm256_packs_epi32(pixels, pixels);

-    pixels = _mm256_max_epi16(minimum,
-                              _mm256_min_epi16(pixels, maximum));
-    minimum = _mm256_sub_epi16(minimum, horizontal_step);
-    maximum = _mm256_sub_epi16(maximum, horizontal_step);
+    minimum = _mm256_sub_epi32(minimum, horizontal_step);
+    maximum = _mm256_sub_epi32(maximum, horizontal_step);

     _mm_storeu_si128((__m128i *)&dstp[x], _mm256_castsi256_si128(_mm256_permute4x64_epi64(pixels, 0xe8))); // 0b11101000
 }
@@ -143,19 +145,11 @@ void simpleResize_int16_t_avx2(const SimpleResize *simple,

     int pel = simple->pel;
     __m256i minimum = _mm256_setzero_si256();
-    __m256i maximum = _mm256_set1_epi16(simple->limit_height * pel - 1);
-    __m256i horizontal_step = _mm256_set1_epi16(horizontal_vectors ? pel * pixels_per_iteration : 0);
-    __m256i vertical_step = _mm256_set1_epi16(horizontal_vectors ? 0 : pel);
-
-    __m256i initial_horizontal_minimum = _mm256_set_epi16(-15 * pel,
-                                                          -14 * pel,
-                                                          -13 * pel,
-                                                          -12 * pel,
-                                                          -11 * pel,
-                                                          -10 * pel,
-                                                          -9 * pel,
-                                                          -8 * pel,
-                                                          -7 * pel,
+    __m256i maximum = _mm256_set1_epi32(simple->limit_height * pel - 1);
+    __m256i horizontal_step = _mm256_set1_epi32(horizontal_vectors ? pel * pixels_per_iteration : 0);
+    __m256i vertical_step = _mm256_set1_epi32(horizontal_vectors ? 0 : pel);
+
+    __m256i initial_horizontal_minimum = _mm256_set_epi32(-7 * pel,
                                                           -6 * pel,
                                                           -5 * pel,
                                                           -4 * pel,
@@ -163,15 +157,7 @@ void simpleResize_int16_t_avx2(const SimpleResize *simple,
                                                           -2 * pel,
                                                           -1 * pel,
                                                           0 * pel);
-    __m256i initial_horizontal_maximum = _mm256_set_epi16((simple->limit_width - 15) * pel - 1,
-                                                          (simple->limit_width - 14) * pel - 1,
-                                                          (simple->limit_width - 13) * pel - 1,
-                                                          (simple->limit_width - 12) * pel - 1,
-                                                          (simple->limit_width - 11) * pel - 1,
-                                                          (simple->limit_width - 10) * pel - 1,
-                                                          (simple->limit_width - 9) * pel - 1,
-                                                          (simple->limit_width - 8) * pel - 1,
-                                                          (simple->limit_width - 7) * pel - 1,
+    __m256i initial_horizontal_maximum = _mm256_set_epi32((simple->limit_width - 7) * pel - 1,
                                                           (simple->limit_width - 6) * pel - 1,
                                                           (simple->limit_width - 5) * pel - 1,
                                                           (simple->limit_width - 4) * pel - 1,
@@ -212,16 +198,16 @@ void simpleResize_int16_t_avx2(const SimpleResize *simple,
             simpleResize_int16_t_horizontal_8px_avx2(simple, dstp, workp, x, minimum, maximum, horizontal_step);

         if (dst_width_avx2 < simple->dst_width) {
-            __m256i step_back = _mm256_set1_epi16((pixels_per_iteration - (simple->dst_width - dst_width_avx2)) * pel);
-            minimum = _mm256_add_epi16(minimum, step_back);
-            maximum = _mm256_add_epi16(maximum, step_back);
+            __m256i step_back = _mm256_set1_epi32((pixels_per_iteration - (simple->dst_width - dst_width_avx2)) * pel);
+            minimum = _mm256_add_epi32(minimum, step_back);
+            maximum = _mm256_add_epi32(maximum, step_back);
             simpleResize_int16_t_horizontal_8px_avx2(simple, dstp, workp, simple->dst_width - pixels_per_iteration, minimum, maximum, horizontal_step);
         }

         dstp += dst_stride;

-        minimum = _mm256_sub_epi16(minimum, vertical_step);
-        maximum = _mm256_sub_epi16(maximum, vertical_step);
+        minimum = _mm256_sub_epi32(minimum, vertical_step);
+        maximum = _mm256_sub_epi32(maximum, vertical_step);
     }

     free(workp);

HolyWu commented 4 years ago

Something weird here. The right border issue is not reproducible now after sekrit-twc's AVX2 optimizations being merged, no matter your patch is applied or not.

dubhater commented 4 years ago

The meaning of the opt parameter changed. There are three possible values now:

0 enables only scalar functions.
1 enables up to SSE2 functions.
2 enables up to AVX2 functions. Or it enables all optimised functions. I'm not quite sure.

False evaluates to 0. True evaluates to 1.

HolyWu commented 4 years ago

Okay, I found the culprit. Now the opt parameter is intended to be used as integer rather than boolean, but all the initialization of d.opt still use !! which perfectly turn all integers into 0 and 1. After fixing that, your patch does prove to work.

dubhater commented 4 years ago

Thanks for testing. I'll fix that too, soon.

Boulder08 commented 4 years ago

Could we have a new release with the AVX2 fixes, please? I think many of us MVTools users would appreciate the increase in performance :)

dubhater commented 4 years ago

It might happen faster if you keep asking about it every few days. It's very easy to forget about it.

Boulder08 commented 4 years ago

I'm lousy at harassing people, especially if they don't get paid for the work. That's why I never made into a good project manager, I'm much more like a project mangler I think..

I'll try to remind you at some point now that you've given me permission to do so.

dubhater commented 4 years ago

Fixed in v22, I guess.

dubhater / vapoursynth-mvtools

FlowInter: Artifacts at left borders when opt=True #38