SSE: Review usage of vminf, vmaxf functions

heckflosse commented 5 years ago

The SSE intrinsics _mm_min_ps(a, b) and _mm_max_ps(a, b) which are used in vminf(a, b) and vmaxf(a, b) return the second argument (b) if one of the arguments is a NaN.

We can use this fact to make for example the vectorized LUT access methods a bit more robust without impact on performance:

diff --git a/rtengine/LUT.h b/rtengine/LUT.h
index d2f758689..9641f0e9e 100644
--- a/rtengine/LUT.h
+++ b/rtengine/LUT.h
@@ -320,7 +320,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(indexv, maxsv), ZEROV); // this automagically uses maxsv in case indexv is a nan
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -352,7 +352,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(indexv, maxsv), ZEROV); // this automagically uses maxsv in case indexv is a nan
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -372,7 +372,7 @@ public:
         vfloat lower = _mm_castsi128_ps(_mm_unpacklo_epi64(temp0, temp1));
         vfloat upper = _mm_castsi128_ps(_mm_unpackhi_epi64(temp0, temp1));

-        vfloat diff = vmaxf(ZEROV, vminf(sizev, indexv)) - _mm_cvtepi32_ps(indexes);
+        vfloat diff = vmaxf(vminf(indexv, sizev), ZEROV) - _mm_cvtepi32_ps(indexes); // this automagically uses sizev in case indexv is a nan
         return vintpf(diff, upper, lower);
     }

@@ -383,7 +383,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(indexv, maxsv), ZEROV); // this automagically uses maxsv in case indexv is a nan
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -420,7 +420,8 @@ public:
     template<typename U = T, typename = typename std::enable_if<std::is_same<U, float>::value>::type>
     vfloat operator[](vint idxv) const
     {
-        vfloat tempv = vmaxf(ZEROV, vminf(sizev, _mm_cvtepi32_ps(idxv))); // convert to float because SSE2 has no min/max for 32bit integers
+        // convert to float because SSE2 has no min/max for 32bit integers
+        vfloat tempv = vmaxf(vminf(_mm_cvtepi32_ps(idxv), sizev), ZEROV); // this automagically uses sizev in case indexv is a nan
         idxv = _mm_cvttps_epi32(tempv);
         // access the LUT 4 times. Trust the compiler. It generates good code here, better than hand written SSE code
         return _mm_setr_ps(data[_mm_cvtsi128_si32(idxv)],

@Floessie Thoughts?

Floessie commented 5 years ago

@heckflosse Good idea. :+1: vminf() and vmaxf() should also be well documented for the NaN case, don't know if that was already done.

-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(indexv, maxsv), ZEROV); // this automagically uses maxsv in case indexv is a nan

The real trick is changing indexv and maxsv here, isn't it? The other change is just a safety net if maxsv is also NaN.

Best, Flössie

heckflosse commented 5 years ago

@Floessie Yes, though we could also keep indexv and maxsv and do just the other change in case we want to get element 0 instead of elemend maxs for NaN indizes. Should we access lut[0] or lut[maxs] in case of nan indizes?

Floessie commented 5 years ago

lut[0] seems more "default" to me.

heckflosse commented 5 years ago

Ok, I will go for lut[0] then and also document the NaN case for vminf and vmaxf

heckflosse commented 5 years ago

@Floessie here's the new patch:

diff --git a/rtengine/LUT.h b/rtengine/LUT.h
index d2f758689..fedc20ca2 100644
--- a/rtengine/LUT.h
+++ b/rtengine/LUT.h
@@ -320,7 +320,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -352,7 +352,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -372,7 +372,7 @@ public:
         vfloat lower = _mm_castsi128_ps(_mm_unpacklo_epi64(temp0, temp1));
         vfloat upper = _mm_castsi128_ps(_mm_unpackhi_epi64(temp0, temp1));

-        vfloat diff = vmaxf(ZEROV, vminf(sizev, indexv)) - _mm_cvtepi32_ps(indexes);
+        vfloat diff = vmaxf(vminf(sizev, indexv), ZEROV) - _mm_cvtepi32_ps(indexes); // this automagically uses ZEROV in case indexv is NaN
         return vintpf(diff, upper, lower);
     }

@@ -383,7 +383,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+        vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -420,7 +420,8 @@ public:
     template<typename U = T, typename = typename std::enable_if<std::is_same<U, float>::value>::type>
     vfloat operator[](vint idxv) const
     {
-        vfloat tempv = vmaxf(ZEROV, vminf(sizev, _mm_cvtepi32_ps(idxv))); // convert to float because SSE2 has no min/max for 32bit integers
+        // convert to float because SSE2 has no min/max for 32bit integers
+        vfloat tempv = vmaxf(vminf(sizev, _mm_cvtepi32_ps(idxv)), ZEROV); // this automagically uses ZEROV in case idxv is NaN (which will never happen because it is a vector of int)
         idxv = _mm_cvttps_epi32(tempv);
         // access the LUT 4 times. Trust the compiler. It generates good code here, better than hand written SSE code
         return _mm_setr_ps(data[_mm_cvtsi128_si32(idxv)],
diff --git a/rtengine/helpersse2.h b/rtengine/helpersse2.h
index 46af3aa89..74780cf48 100644
--- a/rtengine/helpersse2.h
+++ b/rtengine/helpersse2.h
@@ -157,10 +157,14 @@ static INLINE vfloat vsqrtf(vfloat x)
 }
 static INLINE vfloat vmaxf(vfloat x, vfloat y)
 {
+    // _mm_max_ps(x, y) returns y if x is NaN
+    // don't change the order of the parameters
     return _mm_max_ps(x, y);
 }
 static INLINE vfloat vminf(vfloat x, vfloat y)
 {
+    // _mm_min_ps(x, y) returns y if x is NaN
+    // don't change the order of the parameters
     return _mm_min_ps(x, y);
 }

Floessie commented 5 years ago

@heckflosse Doesn't the vmaxf(vminf(...), ...) call for a vclampf()?

heckflosse commented 5 years ago

@Floessie there is no vclampf() atm. Let's complete this one first. Then I will look for all vectorized clamp cases in a different issue (or at least with a different patch)

Floessie commented 5 years ago

@heckflosse

there is no vclampf() atm

That's what I meant. :)

Let's complete this one first. Then I will look for all vectorized clamp cases in a different issue (or at least with a different patch)

:+1:

heckflosse commented 5 years ago

@Floessie now we have the situation that vminf(a,b) returns b if a is NaN while in rt_math.h

template<typename T>
constexpr const T& min(const T& a, const T& b)
{
    return b < a ? b : a;
}

returns a if a is Nan

Shall I make it consistent?

heckflosse commented 5 years ago

@Floessie Here's the vclampf patch

diff --git a/rtengine/CA_correct_RT.cc b/rtengine/CA_correct_RT.cc
index 22ad77e63..2fa589110 100644
--- a/rtengine/CA_correct_RT.cc
+++ b/rtengine/CA_correct_RT.cc
@@ -1252,7 +1252,7 @@ float* RawImageSource::CA_correct_RT(
                         vfloat factors = oldvals / newvals;
                         factors = vself(vmaskf_le(newvals, onev), onev, factors);
                         factors = vself(vmaskf_le(oldvals, onev), onev, factors);
-                        STVFU((*nonGreen)[i/2][j/2], LIMV(factors, zd5v, twov));
+                        STVFU((*nonGreen)[i/2][j/2], vclampf(factors, zd5v, twov));
                     }
 #endif
                     for (; j < W - 2 * cb; j += 2) {
diff --git a/rtengine/LUT.h b/rtengine/LUT.h
index fedc20ca2..de668cca8 100644
--- a/rtengine/LUT.h
+++ b/rtengine/LUT.h
@@ -320,7 +320,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
+        vfloat clampedIndexes = vclampf(indexv, ZEROV, maxsv); // this automagically uses ZEROV in case indexv is NaN
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -352,7 +352,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
+        vfloat clampedIndexes = vclampf(indexv, ZEROV, maxsv); // this automagically uses ZEROV in case indexv is NaN
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -372,7 +372,7 @@ public:
         vfloat lower = _mm_castsi128_ps(_mm_unpacklo_epi64(temp0, temp1));
         vfloat upper = _mm_castsi128_ps(_mm_unpackhi_epi64(temp0, temp1));

-        vfloat diff = vmaxf(vminf(sizev, indexv), ZEROV) - _mm_cvtepi32_ps(indexes); // this automagically uses ZEROV in case indexv is NaN
+        vfloat diff = vclampf(indexv, ZEROV, sizev) - _mm_cvtepi32_ps(indexes); // this automagically uses ZEROV in case indexv is NaN
         return vintpf(diff, upper, lower);
     }

@@ -383,7 +383,7 @@ public:

         // Clamp and convert to integer values. Extract out of SSE register because all
         // lookup operations use regular addresses.
-        vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
+        vfloat clampedIndexes = vclampf(indexv, ZEROV, maxsv); // this automagically uses ZEROV in case indexv is NaN
         vint indexes = _mm_cvttps_epi32(clampedIndexes);
         int indexArray[4];
         _mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -421,7 +421,7 @@ public:
     vfloat operator[](vint idxv) const
     {
         // convert to float because SSE2 has no min/max for 32bit integers
-        vfloat tempv = vmaxf(vminf(sizev, _mm_cvtepi32_ps(idxv)), ZEROV); // this automagically uses ZEROV in case idxv is NaN (which will never happen because it is a vector of int)
+        vfloat tempv = vclampf(_mm_cvtepi32_ps(idxv), ZEROV, sizev); // this automagically uses ZEROV in case idxv is NaN (which will never happen because it is a vector of int)
         idxv = _mm_cvttps_epi32(tempv);
         // access the LUT 4 times. Trust the compiler. It generates good code here, better than hand written SSE code
         return _mm_setr_ps(data[_mm_cvtsi128_si32(idxv)],
diff --git a/rtengine/curves.h b/rtengine/curves.h
index 30fb01102..c66c19a27 100644
--- a/rtengine/curves.h
+++ b/rtengine/curves.h
@@ -1157,9 +1157,9 @@ inline void WeightedStdToneCurve::BatchApply(const size_t start, const size_t en
     float tmpb[4] ALIGNED16;

     for (; i + 3 < end; i += 4) {
-        vfloat r_val = LIMV(LVF(r[i]), ZEROV, c65535v);
-        vfloat g_val = LIMV(LVF(g[i]), ZEROV, c65535v);
-        vfloat b_val = LIMV(LVF(b[i]), ZEROV, c65535v);
+        vfloat r_val = vclampf(LVF(r[i]), ZEROV, c65535v);
+        vfloat g_val = vclampf(LVF(g[i]), ZEROV, c65535v);
+        vfloat b_val = vclampf(LVF(b[i]), ZEROV, c65535v);
         vfloat r1 = lutToneCurve[r_val];
         vfloat g1 = Triangle(r_val, r1, g_val);
         vfloat b1 = Triangle(r_val, r1, b_val);
@@ -1172,9 +1172,9 @@ inline void WeightedStdToneCurve::BatchApply(const size_t start, const size_t en
         vfloat r3 = Triangle(b_val, b3, r_val);
         vfloat g3 = Triangle(b_val, b3, g_val);

-        STVF(tmpr[0], LIMV(r1 * zd5v + r2 * zd25v + r3 * zd25v, ZEROV, c65535v));
-        STVF(tmpg[0], LIMV(g1 * zd25v + g2 * zd5v + g3 * zd25v, ZEROV, c65535v));
-        STVF(tmpb[0], LIMV(b1 * zd25v + b2 * zd25v + b3 * zd5v, ZEROV, c65535v));
+        STVF(tmpr[0], vclampf(r1 * zd5v + r2 * zd25v + r3 * zd25v, ZEROV, c65535v));
+        STVF(tmpg[0], vclampf(g1 * zd25v + g2 * zd5v + g3 * zd25v, ZEROV, c65535v));
+        STVF(tmpb[0], vclampf(b1 * zd25v + b2 * zd25v + b3 * zd5v, ZEROV, c65535v));
         for (int j = 0; j < 4; ++j) {
             setUnlessOOG(r[i+j], g[i+j], b[i+j], tmpr[j], tmpg[j], tmpb[j]);
         }
diff --git a/rtengine/demosaic_algos.cc b/rtengine/demosaic_algos.cc
index 10ef0dd0e..a324a7ca6 100644
--- a/rtengine/demosaic_algos.cc
+++ b/rtengine/demosaic_algos.cc
@@ -1399,7 +1399,7 @@ void RawImageSource::lmmse_interpolate_omp(int winw, int winh, array2D<float> &r
 // Adapted to RawTherapee by Jacques Desmis 3/2013
 // SSE version by Ingo Weyrich 5/2013
 #ifdef __SSE2__
-#define CLIPV(a) LIMV(a,zerov,c65535v)
+#define CLIPV(a) vclampf(a,zerov,c65535v)
 void RawImageSource::igv_interpolate(int winw, int winh)
 {
     static const float eps = 1e-5f, epssq = 1e-5f; //mod epssq -10f =>-5f Jacques 3/2013 to prevent artifact (divide by zero)
@@ -1513,10 +1513,10 @@ void RawImageSource::igv_interpolate(int winw, int winh)
                 //N,E,W,S Hamilton Adams Interpolation
                 // (48.f * 65535.f) = 3145680.f
                 tempv = c40v * LVFU(rgb[0][indx1]);
-                nvv = LIMV(((c23v * LVFU(rgb[1][(indx - v1) >> 1]) + c23v * LVFU(rgb[1][(indx - v3) >> 1]) + LVFU(rgb[1][(indx - v5) >> 1]) + LVFU(rgb[1][(indx + v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - v1)]) - c8v * LVFU(rgb[0][(indx1 - v2)]))) / c3145680v, zerov, onev);
-                evv = LIMV(((c23v * LVFU(rgb[1][(indx + h1) >> 1]) + c23v * LVFU(rgb[1][(indx + h3) >> 1]) + LVFU(rgb[1][(indx + h5) >> 1]) + LVFU(rgb[1][(indx - h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + h1)]) - c8v * LVFU(rgb[0][(indx1 + h2)]))) / c3145680v, zerov, onev);
-                wvv = LIMV(((c23v * LVFU(rgb[1][(indx - h1) >> 1]) + c23v * LVFU(rgb[1][(indx - h3) >> 1]) + LVFU(rgb[1][(indx - h5) >> 1]) + LVFU(rgb[1][(indx + h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - h1)]) - c8v * LVFU(rgb[0][(indx1 - h2)]))) / c3145680v, zerov, onev);
-                svv = LIMV(((c23v * LVFU(rgb[1][(indx + v1) >> 1]) + c23v * LVFU(rgb[1][(indx + v3) >> 1]) + LVFU(rgb[1][(indx + v5) >> 1]) + LVFU(rgb[1][(indx - v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + v1)]) - c8v * LVFU(rgb[0][(indx1 + v2)]))) / c3145680v, zerov, onev);
+                nvv = vclampf(((c23v * LVFU(rgb[1][(indx - v1) >> 1]) + c23v * LVFU(rgb[1][(indx - v3) >> 1]) + LVFU(rgb[1][(indx - v5) >> 1]) + LVFU(rgb[1][(indx + v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - v1)]) - c8v * LVFU(rgb[0][(indx1 - v2)]))) / c3145680v, zerov, onev);
+                evv = vclampf(((c23v * LVFU(rgb[1][(indx + h1) >> 1]) + c23v * LVFU(rgb[1][(indx + h3) >> 1]) + LVFU(rgb[1][(indx + h5) >> 1]) + LVFU(rgb[1][(indx - h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + h1)]) - c8v * LVFU(rgb[0][(indx1 + h2)]))) / c3145680v, zerov, onev);
+                wvv = vclampf(((c23v * LVFU(rgb[1][(indx - h1) >> 1]) + c23v * LVFU(rgb[1][(indx - h3) >> 1]) + LVFU(rgb[1][(indx - h5) >> 1]) + LVFU(rgb[1][(indx + h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - h1)]) - c8v * LVFU(rgb[0][(indx1 - h2)]))) / c3145680v, zerov, onev);
+                svv = vclampf(((c23v * LVFU(rgb[1][(indx + v1) >> 1]) + c23v * LVFU(rgb[1][(indx + v3) >> 1]) + LVFU(rgb[1][(indx + v5) >> 1]) + LVFU(rgb[1][(indx - v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + v1)]) - c8v * LVFU(rgb[0][(indx1 + v2)]))) / c3145680v, zerov, onev);
                 //Horizontal and vertical color differences
                 tempv = LVFU( rgb[0][indx1] ) / c65535v;
                 _mm_storeu_ps( &vdif[indx1], (sgv * nvv + ngv * svv) / (ngv + sgv) - tempv );
@@ -1561,9 +1561,9 @@ void RawImageSource::igv_interpolate(int winw, int winh)
             for (col = 7 + (FC(row, 1) & 1), indx1 = (row * width + col) >> 1, d = FC(row, col) / 2; col < width - 14; col += 8, indx1 += 4) {
                 //H&V integrated gaussian vector over variance on color differences
                 //Mod Jacques 3/2013
-                ngv = LIMV(epssqv + c78v * SQRV(LVFU(vdif[indx1])) + c69v * (SQRV(LVFU(vdif[indx1 - v1])) + SQRV(LVFU(vdif[indx1 + v1]))) + c51v * (SQRV(LVFU(vdif[indx1 - v2])) + SQRV(LVFU(vdif[indx1 + v2]))) + c21v * (SQRV(LVFU(vdif[indx1 - v3])) + SQRV(LVFU(vdif[indx1 + v3]))) - c6v * SQRV(LVFU(vdif[indx1 - v1]) + LVFU(vdif[indx1]) + LVFU(vdif[indx1 + v1]))
+                ngv = vclampf(epssqv + c78v * SQRV(LVFU(vdif[indx1])) + c69v * (SQRV(LVFU(vdif[indx1 - v1])) + SQRV(LVFU(vdif[indx1 + v1]))) + c51v * (SQRV(LVFU(vdif[indx1 - v2])) + SQRV(LVFU(vdif[indx1 + v2]))) + c21v * (SQRV(LVFU(vdif[indx1 - v3])) + SQRV(LVFU(vdif[indx1 + v3]))) - c6v * SQRV(LVFU(vdif[indx1 - v1]) + LVFU(vdif[indx1]) + LVFU(vdif[indx1 + v1]))
                            - c10v * (SQRV(LVFU(vdif[indx1 - v2]) + LVFU(vdif[indx1 - v1]) + LVFU(vdif[indx1])) + SQRV(LVFU(vdif[indx1]) + LVFU(vdif[indx1 + v1]) + LVFU(vdif[indx1 + v2]))) - c7v * (SQRV(LVFU(vdif[indx1 - v3]) + LVFU(vdif[indx1 - v2]) + LVFU(vdif[indx1 - v1])) + SQRV(LVFU(vdif[indx1 + v1]) + LVFU(vdif[indx1 + v2]) + LVFU(vdif[indx1 + v3]))), zerov, onev);
-                egv = LIMV(epssqv + c78v * SQRV(LVFU(hdif[indx1])) + c69v * (SQRV(LVFU(hdif[indx1 - h1])) + SQRV(LVFU(hdif[indx1 + h1]))) + c51v * (SQRV(LVFU(hdif[indx1 - h2])) + SQRV(LVFU(hdif[indx1 + h2]))) + c21v * (SQRV(LVFU(hdif[indx1 - h3])) + SQRV(LVFU(hdif[indx1 + h3]))) - c6v * SQRV(LVFU(hdif[indx1 - h1]) + LVFU(hdif[indx1]) + LVFU(hdif[indx1 + h1]))
+                egv = vclampf(epssqv + c78v * SQRV(LVFU(hdif[indx1])) + c69v * (SQRV(LVFU(hdif[indx1 - h1])) + SQRV(LVFU(hdif[indx1 + h1]))) + c51v * (SQRV(LVFU(hdif[indx1 - h2])) + SQRV(LVFU(hdif[indx1 + h2]))) + c21v * (SQRV(LVFU(hdif[indx1 - h3])) + SQRV(LVFU(hdif[indx1 + h3]))) - c6v * SQRV(LVFU(hdif[indx1 - h1]) + LVFU(hdif[indx1]) + LVFU(hdif[indx1 + h1]))
                            - c10v * (SQRV(LVFU(hdif[indx1 - h2]) + LVFU(hdif[indx1 - h1]) + LVFU(hdif[indx1])) + SQRV(LVFU(hdif[indx1]) + LVFU(hdif[indx1 + h1]) + LVFU(hdif[indx1 + h2]))) - c7v * (SQRV(LVFU(hdif[indx1 - h3]) + LVFU(hdif[indx1 - h2]) + LVFU(hdif[indx1 - h1])) + SQRV(LVFU(hdif[indx1 + h1]) + LVFU(hdif[indx1 + h2]) + LVFU(hdif[indx1 + h3]))), zerov, onev);
                 //Limit chrominance using H/V neighbourhood
                 nvv = median(d725v * LVFU(vdif[indx1]) + d1375v * LVFU(vdif[indx1 - v1]) + d1375v * LVFU(vdif[indx1 + v1]), LVFU(vdif[indx1 - v1]), LVFU(vdif[indx1 + v1]));
@@ -2114,7 +2114,7 @@ void RawImageSource::nodemosaic(bool bw)
 */

 #ifdef __SSE2__
-#define CLIPV(a) LIMV(a,ZEROV,c65535v)
+#define CLIPV(a) vclampf(a,ZEROV,c65535v)
 #endif
 void RawImageSource::refinement(int PassCount)
 {
diff --git a/rtengine/iplabregions.cc b/rtengine/iplabregions.cc
index d5bbd6302..d2380494a 100644
--- a/rtengine/iplabregions.cc
+++ b/rtengine/iplabregions.cc
@@ -204,9 +204,9 @@ BENCHFUN
                 for (int i = 0; i < n; ++i) {
                     vfloat blendv = LVFU(abmask[i][y][x]);
                     vfloat sv = F2V(rs[i]);
-                    vfloat a_newv = LIMV(sv * (av + F2V(abca[i])), cm42000v, c42000v);
-                    vfloat b_newv = LIMV(sv * (bv + F2V(abcb[i])), cm42000v, c42000v);
-                    vfloat l_newv = LIMV(lv * F2V(rl[i]), ZEROV, c32768v);
+                    vfloat a_newv = vclampf(sv * (av + F2V(abca[i])), cm42000v, c42000v);
+                    vfloat b_newv = vclampf(sv * (bv + F2V(abcb[i])), cm42000v, c42000v);
+                    vfloat l_newv = vclampf(lv * F2V(rl[i]), ZEROV, c32768v);
                     lv = vintpf(LVFU(Lmask[i][y][x]), l_newv, lv);
                     av = vintpf(blendv, a_newv, av);
                     bv = vintpf(blendv, b_newv, bv);
diff --git a/rtengine/ipretinex.cc b/rtengine/ipretinex.cc
index e79ee52a3..fa961fbbb 100644
--- a/rtengine/ipretinex.cc
+++ b/rtengine/ipretinex.cc
@@ -504,11 +504,11 @@ void RawImageSource::MSR(float** luminance, float** originalLuminance, float **e

                     if(useHslLin) {
                         for (; j < W_L - 3; j += 4) {
-                            _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv *  (LIMV(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
+                            _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv *  (vclampf(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
                         }
                     } else {
                         for (; j < W_L - 3; j += 4) {
-                            _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv *  xlogf(LIMV(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
+                            _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv *  xlogf(vclampf(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
                         }
                     }

diff --git a/rtengine/sleefsseavx.c b/rtengine/sleefsseavx.c
index c68f11ae0..83d937bd1 100644
--- a/rtengine/sleefsseavx.c
+++ b/rtengine/sleefsseavx.c
@@ -1368,8 +1368,9 @@ static INLINE vfloat xcbrtf(vfloat d) {
   return y;
 }

-static INLINE vfloat LIMV( vfloat a, vfloat b, vfloat c ) {
-return vmaxf( b, vminf(a,c));
+static INLINE vfloat vclampf(vfloat value, vfloat low, vfloat high) {
+    // clamps value in [low;high], returns low if value is NaN
+    return vmaxf(vminf(high, value), low);
 }

 static INLINE vfloat SQRV(vfloat a){

Floessie commented 5 years ago

@heckflosse

Shall I make it consistent?

That would be beneficial. But please remember that we implemented rtengine::min() like this for a reason (better code by the compiler, on par with std::min()), so rtengine::min() shouldn't be modified.

As always, excellent work, Ingo! :+1:

heckflosse commented 5 years ago

@Floessie hmm

rtengine::min could also be

template<typename T>
constexpr const T& min(const T& a, const T& b)
{
    return a < b ? a : b;
}

which should give the same performance as the current

template<typename T>
constexpr const T& min(const T& a, const T& b)
{
    return b < a ? b : a;
}

but would be consistent with vminf.

Same for rtengine::max

Floessie commented 5 years ago

AFAIK, the a < b ? a : b was the initial implementation and we changed it for the better after studying the assembly.

heckflosse commented 5 years ago

I will delay that for after 5.5 release. For now I will commit vclampf and close the issue. Ok?

Thanatomanic commented 5 years ago

Honestly? Layman optimization person here. But does the order really matter for efficiency?

heckflosse commented 5 years ago

@Thanatomanic I don't think the order should matter for efficiency in this case. But it matters for the result, which is different between the two implementations in case a is NaN

Floessie commented 5 years ago

@heckflosse @Thanatomanic Hm, I only find #3742. I'm pretty sure, I've checked the assembly and GCC optimizes for the b < a case, but I can re-check it tomorrow.

heckflosse commented 5 years ago

@Floessie Maybe you checked the >= case ?

heckflosse commented 5 years ago

https://godbolt.org/ will tell us ;-)

heckflosse commented 5 years ago

grafik

Somehow the screenshot of the SSE2-only version got lost: grafik

heckflosse commented 5 years ago

So, one more reason to prefer

template<typename T>
constexpr const T& min(const T& a, const T& b)
{
    return a < b ? a : b;
}

over

template<typename T>
constexpr const T& min(const T& a, const T& b)
{
    return b < a ? b : a;
}

Floessie commented 5 years ago

@heckflosse Yeah, you're right. My mind plays tricks on me. I must have mixed it up with something different.

Floessie commented 5 years ago

Still interesting, that the STL implements it like we do currently...

heckflosse commented 5 years ago

@Floessie I guess the STL follows the rule: if one argument is NaN, then the result will be Nan

Thanatomanic commented 5 years ago

@heckflosse .. which to me makes sense. If you're comparing to a non-number, still getting a number out is a bit silly.

heckflosse commented 5 years ago

@Thanatomanic yes, but that's what _mm_min_ps and _mm_max_ps do. It's just a matter of the order of arguments. For that reason I want to get it consistent to get the same behaviour in SSE code and in scalar code which processes the remaining pixels if the length of the input % 4 != 0

heckflosse commented 5 years ago

@Thanatomanic Well, whether that's silly is a different topic ;-)

Floessie commented 5 years ago

@heckflosse Found it: The reason is, the standard says "Returns the first argument when the arguments are equivalent".

heckflosse commented 5 years ago

@Floessie Isn't that true also for the changed implementation (because the first argument is equal to the second argument) or do I miss something?

Floessie commented 5 years ago

If a == b it takes the false branch, so b is returned.

Floessie commented 5 years ago

It's N3242 25.4.7 remark 3 on page 887.

heckflosse commented 5 years ago

@Thanatomanic The STL does not what you describe, try this:

    float nanvalue = 0.f / 0.f;
    std::cout << "std::min(nanvalue, 0.f) : " << std::min(nanvalue, 0.f) << std::endl;
    std::cout << "std::min(0.f, nanvalue) : " << std::min(0.f, nanvalue) << std::endl;

output on my machine:

std::min(nanvalue, 0.f) : nan
std::min(0.f, nanvalue) : 0

heckflosse commented 5 years ago

Ok, let's focus on the desired behaviour. I'm for returning value of b in rtengine::min/max(a,b) if a is NaN for consistency reasons.

heckflosse commented 5 years ago

In STL std::mim/max with one ore two arguments being NaN is undefined behaviour. I would like to get a defined behaviour for rtengine::min/max.

On machines with SSE that can be simply done by this


#ifdef __SSE2__
    float rtengine::min(float a,float b) {
       return _mm_cvtss_f32(_mm_min_ss(_mm_set_ss(a), _mm_set_ss(b)));
    }
#else
  something different for non SSE2
#endif

Thanatomanic commented 5 years ago

I strongly feel for this reasoning regarding NaN: https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values/1573715

Then again, I have no clue where and how often a NaN comparison might occur in RT code.

heckflosse commented 5 years ago

There are only a few NaN comparisons in RT code. But we often ran into NaN problems (especially when accessing Lookup tables using NaNs. There has been more than one request to avoid NaN access in Lookup tables (rtengine/LUT.h) and I always voted against solving this in LUT.h for two reasons:

1) degrading performance of LUT.h access methods (that's not an issue with the current change) 2) crashing on LUT NaN access leads to earlier fixes (still valid)

Floessie commented 5 years ago

@heckflosse How about this?

diff --git a/rtengine/rt_math.h b/rtengine/rt_math.h
index 9342f5430..b21d26564 100644
--- a/rtengine/rt_math.h
+++ b/rtengine/rt_math.h
@@ -56,6 +56,13 @@ constexpr const T& min(const T& a, const T& b)
     return b < a ? b : a;
 }

+template<>
+constexpr const float& min(const float& a, const float& b)
+{
+    // For consistency reasons we return b instead of a if a == b or a == NaN
+    return a < b ? a : b;
+}
+
 template<typename T, typename... ARGS>
 constexpr const T& min(const T& a, const T& b, const ARGS&... args)
 {
@@ -74,6 +81,13 @@ constexpr const T& max(const T& a, const T& b)
     return a < b ? b : a;
 }

+template<>
+constexpr const float& max(const float& a, const float& b)
+{
+    // For consistency reasons we return b instead of a if a == b or a == NaN
+    return b < a ? a : b;
+}
+
 template<typename T, typename... ARGS>
 constexpr const T& max(const T& a, const T& b, const ARGS&... args)
 {

I tried to integrate your SSE2 code, but that would mean loosing constexpr and even const T& return types on all templates. :unamused:

heckflosse commented 5 years ago

@Floessie Looks good :+1: Though I think we should not do that for 5.5 because it's a bit risky.

Assume for example cases of max(0.f, some_variable) where the current implementation returns 0.f if some_variable is NaN while the new version would return NaN. We have to inspect all the cases when we change this.

Beep6581 / RawTherapee

SSE: Review usage of vminf, vmaxf functions #4942