Open heckflosse opened 5 years ago
@heckflosse Good idea. :+1: vminf()
and vmaxf()
should also be well documented for the NaN case, don't know if that was already done.
- vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+ vfloat clampedIndexes = vmaxf(vminf(indexv, maxsv), ZEROV); // this automagically uses maxsv in case indexv is a nan
The real trick is changing indexv
and maxsv
here, isn't it? The other change is just a safety net if maxsv
is also NaN.
Best, Flössie
@Floessie Yes, though we could also keep indexv and maxsv and do just the other change in case we want to get element 0 instead of elemend maxs for NaN indizes. Should we access lut[0] or lut[maxs] in case of nan indizes?
lut[0]
seems more "default" to me.
Ok, I will go for lut[0] then and also document the NaN case for vminf and vmaxf
@Floessie here's the new patch:
diff --git a/rtengine/LUT.h b/rtengine/LUT.h
index d2f758689..fedc20ca2 100644
--- a/rtengine/LUT.h
+++ b/rtengine/LUT.h
@@ -320,7 +320,7 @@ public:
// Clamp and convert to integer values. Extract out of SSE register because all
// lookup operations use regular addresses.
- vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+ vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
vint indexes = _mm_cvttps_epi32(clampedIndexes);
int indexArray[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -352,7 +352,7 @@ public:
// Clamp and convert to integer values. Extract out of SSE register because all
// lookup operations use regular addresses.
- vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+ vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
vint indexes = _mm_cvttps_epi32(clampedIndexes);
int indexArray[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -372,7 +372,7 @@ public:
vfloat lower = _mm_castsi128_ps(_mm_unpacklo_epi64(temp0, temp1));
vfloat upper = _mm_castsi128_ps(_mm_unpackhi_epi64(temp0, temp1));
- vfloat diff = vmaxf(ZEROV, vminf(sizev, indexv)) - _mm_cvtepi32_ps(indexes);
+ vfloat diff = vmaxf(vminf(sizev, indexv), ZEROV) - _mm_cvtepi32_ps(indexes); // this automagically uses ZEROV in case indexv is NaN
return vintpf(diff, upper, lower);
}
@@ -383,7 +383,7 @@ public:
// Clamp and convert to integer values. Extract out of SSE register because all
// lookup operations use regular addresses.
- vfloat clampedIndexes = vmaxf(ZEROV, vminf(maxsv, indexv));
+ vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
vint indexes = _mm_cvttps_epi32(clampedIndexes);
int indexArray[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -420,7 +420,8 @@ public:
template<typename U = T, typename = typename std::enable_if<std::is_same<U, float>::value>::type>
vfloat operator[](vint idxv) const
{
- vfloat tempv = vmaxf(ZEROV, vminf(sizev, _mm_cvtepi32_ps(idxv))); // convert to float because SSE2 has no min/max for 32bit integers
+ // convert to float because SSE2 has no min/max for 32bit integers
+ vfloat tempv = vmaxf(vminf(sizev, _mm_cvtepi32_ps(idxv)), ZEROV); // this automagically uses ZEROV in case idxv is NaN (which will never happen because it is a vector of int)
idxv = _mm_cvttps_epi32(tempv);
// access the LUT 4 times. Trust the compiler. It generates good code here, better than hand written SSE code
return _mm_setr_ps(data[_mm_cvtsi128_si32(idxv)],
diff --git a/rtengine/helpersse2.h b/rtengine/helpersse2.h
index 46af3aa89..74780cf48 100644
--- a/rtengine/helpersse2.h
+++ b/rtengine/helpersse2.h
@@ -157,10 +157,14 @@ static INLINE vfloat vsqrtf(vfloat x)
}
static INLINE vfloat vmaxf(vfloat x, vfloat y)
{
+ // _mm_max_ps(x, y) returns y if x is NaN
+ // don't change the order of the parameters
return _mm_max_ps(x, y);
}
static INLINE vfloat vminf(vfloat x, vfloat y)
{
+ // _mm_min_ps(x, y) returns y if x is NaN
+ // don't change the order of the parameters
return _mm_min_ps(x, y);
}
@heckflosse Doesn't the vmaxf(vminf(...), ...)
call for a vclampf()
?
@Floessie there is no vclampf() atm. Let's complete this one first. Then I will look for all vectorized clamp cases in a different issue (or at least with a different patch)
@heckflosse
there is no vclampf() atm
That's what I meant. :)
Let's complete this one first. Then I will look for all vectorized clamp cases in a different issue (or at least with a different patch)
:+1:
@Floessie now we have the situation that vminf(a,b) returns b if a is NaN while in rt_math.h
template<typename T>
constexpr const T& min(const T& a, const T& b)
{
return b < a ? b : a;
}
returns a if a is Nan
Shall I make it consistent?
@Floessie Here's the vclampf
patch
diff --git a/rtengine/CA_correct_RT.cc b/rtengine/CA_correct_RT.cc
index 22ad77e63..2fa589110 100644
--- a/rtengine/CA_correct_RT.cc
+++ b/rtengine/CA_correct_RT.cc
@@ -1252,7 +1252,7 @@ float* RawImageSource::CA_correct_RT(
vfloat factors = oldvals / newvals;
factors = vself(vmaskf_le(newvals, onev), onev, factors);
factors = vself(vmaskf_le(oldvals, onev), onev, factors);
- STVFU((*nonGreen)[i/2][j/2], LIMV(factors, zd5v, twov));
+ STVFU((*nonGreen)[i/2][j/2], vclampf(factors, zd5v, twov));
}
#endif
for (; j < W - 2 * cb; j += 2) {
diff --git a/rtengine/LUT.h b/rtengine/LUT.h
index fedc20ca2..de668cca8 100644
--- a/rtengine/LUT.h
+++ b/rtengine/LUT.h
@@ -320,7 +320,7 @@ public:
// Clamp and convert to integer values. Extract out of SSE register because all
// lookup operations use regular addresses.
- vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
+ vfloat clampedIndexes = vclampf(indexv, ZEROV, maxsv); // this automagically uses ZEROV in case indexv is NaN
vint indexes = _mm_cvttps_epi32(clampedIndexes);
int indexArray[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -352,7 +352,7 @@ public:
// Clamp and convert to integer values. Extract out of SSE register because all
// lookup operations use regular addresses.
- vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
+ vfloat clampedIndexes = vclampf(indexv, ZEROV, maxsv); // this automagically uses ZEROV in case indexv is NaN
vint indexes = _mm_cvttps_epi32(clampedIndexes);
int indexArray[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -372,7 +372,7 @@ public:
vfloat lower = _mm_castsi128_ps(_mm_unpacklo_epi64(temp0, temp1));
vfloat upper = _mm_castsi128_ps(_mm_unpackhi_epi64(temp0, temp1));
- vfloat diff = vmaxf(vminf(sizev, indexv), ZEROV) - _mm_cvtepi32_ps(indexes); // this automagically uses ZEROV in case indexv is NaN
+ vfloat diff = vclampf(indexv, ZEROV, sizev) - _mm_cvtepi32_ps(indexes); // this automagically uses ZEROV in case indexv is NaN
return vintpf(diff, upper, lower);
}
@@ -383,7 +383,7 @@ public:
// Clamp and convert to integer values. Extract out of SSE register because all
// lookup operations use regular addresses.
- vfloat clampedIndexes = vmaxf(vminf(maxsv, indexv), ZEROV); // this automagically uses ZEROV in case indexv is NaN
+ vfloat clampedIndexes = vclampf(indexv, ZEROV, maxsv); // this automagically uses ZEROV in case indexv is NaN
vint indexes = _mm_cvttps_epi32(clampedIndexes);
int indexArray[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(&indexArray[0]), indexes);
@@ -421,7 +421,7 @@ public:
vfloat operator[](vint idxv) const
{
// convert to float because SSE2 has no min/max for 32bit integers
- vfloat tempv = vmaxf(vminf(sizev, _mm_cvtepi32_ps(idxv)), ZEROV); // this automagically uses ZEROV in case idxv is NaN (which will never happen because it is a vector of int)
+ vfloat tempv = vclampf(_mm_cvtepi32_ps(idxv), ZEROV, sizev); // this automagically uses ZEROV in case idxv is NaN (which will never happen because it is a vector of int)
idxv = _mm_cvttps_epi32(tempv);
// access the LUT 4 times. Trust the compiler. It generates good code here, better than hand written SSE code
return _mm_setr_ps(data[_mm_cvtsi128_si32(idxv)],
diff --git a/rtengine/curves.h b/rtengine/curves.h
index 30fb01102..c66c19a27 100644
--- a/rtengine/curves.h
+++ b/rtengine/curves.h
@@ -1157,9 +1157,9 @@ inline void WeightedStdToneCurve::BatchApply(const size_t start, const size_t en
float tmpb[4] ALIGNED16;
for (; i + 3 < end; i += 4) {
- vfloat r_val = LIMV(LVF(r[i]), ZEROV, c65535v);
- vfloat g_val = LIMV(LVF(g[i]), ZEROV, c65535v);
- vfloat b_val = LIMV(LVF(b[i]), ZEROV, c65535v);
+ vfloat r_val = vclampf(LVF(r[i]), ZEROV, c65535v);
+ vfloat g_val = vclampf(LVF(g[i]), ZEROV, c65535v);
+ vfloat b_val = vclampf(LVF(b[i]), ZEROV, c65535v);
vfloat r1 = lutToneCurve[r_val];
vfloat g1 = Triangle(r_val, r1, g_val);
vfloat b1 = Triangle(r_val, r1, b_val);
@@ -1172,9 +1172,9 @@ inline void WeightedStdToneCurve::BatchApply(const size_t start, const size_t en
vfloat r3 = Triangle(b_val, b3, r_val);
vfloat g3 = Triangle(b_val, b3, g_val);
- STVF(tmpr[0], LIMV(r1 * zd5v + r2 * zd25v + r3 * zd25v, ZEROV, c65535v));
- STVF(tmpg[0], LIMV(g1 * zd25v + g2 * zd5v + g3 * zd25v, ZEROV, c65535v));
- STVF(tmpb[0], LIMV(b1 * zd25v + b2 * zd25v + b3 * zd5v, ZEROV, c65535v));
+ STVF(tmpr[0], vclampf(r1 * zd5v + r2 * zd25v + r3 * zd25v, ZEROV, c65535v));
+ STVF(tmpg[0], vclampf(g1 * zd25v + g2 * zd5v + g3 * zd25v, ZEROV, c65535v));
+ STVF(tmpb[0], vclampf(b1 * zd25v + b2 * zd25v + b3 * zd5v, ZEROV, c65535v));
for (int j = 0; j < 4; ++j) {
setUnlessOOG(r[i+j], g[i+j], b[i+j], tmpr[j], tmpg[j], tmpb[j]);
}
diff --git a/rtengine/demosaic_algos.cc b/rtengine/demosaic_algos.cc
index 10ef0dd0e..a324a7ca6 100644
--- a/rtengine/demosaic_algos.cc
+++ b/rtengine/demosaic_algos.cc
@@ -1399,7 +1399,7 @@ void RawImageSource::lmmse_interpolate_omp(int winw, int winh, array2D<float> &r
// Adapted to RawTherapee by Jacques Desmis 3/2013
// SSE version by Ingo Weyrich 5/2013
#ifdef __SSE2__
-#define CLIPV(a) LIMV(a,zerov,c65535v)
+#define CLIPV(a) vclampf(a,zerov,c65535v)
void RawImageSource::igv_interpolate(int winw, int winh)
{
static const float eps = 1e-5f, epssq = 1e-5f; //mod epssq -10f =>-5f Jacques 3/2013 to prevent artifact (divide by zero)
@@ -1513,10 +1513,10 @@ void RawImageSource::igv_interpolate(int winw, int winh)
//N,E,W,S Hamilton Adams Interpolation
// (48.f * 65535.f) = 3145680.f
tempv = c40v * LVFU(rgb[0][indx1]);
- nvv = LIMV(((c23v * LVFU(rgb[1][(indx - v1) >> 1]) + c23v * LVFU(rgb[1][(indx - v3) >> 1]) + LVFU(rgb[1][(indx - v5) >> 1]) + LVFU(rgb[1][(indx + v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - v1)]) - c8v * LVFU(rgb[0][(indx1 - v2)]))) / c3145680v, zerov, onev);
- evv = LIMV(((c23v * LVFU(rgb[1][(indx + h1) >> 1]) + c23v * LVFU(rgb[1][(indx + h3) >> 1]) + LVFU(rgb[1][(indx + h5) >> 1]) + LVFU(rgb[1][(indx - h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + h1)]) - c8v * LVFU(rgb[0][(indx1 + h2)]))) / c3145680v, zerov, onev);
- wvv = LIMV(((c23v * LVFU(rgb[1][(indx - h1) >> 1]) + c23v * LVFU(rgb[1][(indx - h3) >> 1]) + LVFU(rgb[1][(indx - h5) >> 1]) + LVFU(rgb[1][(indx + h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - h1)]) - c8v * LVFU(rgb[0][(indx1 - h2)]))) / c3145680v, zerov, onev);
- svv = LIMV(((c23v * LVFU(rgb[1][(indx + v1) >> 1]) + c23v * LVFU(rgb[1][(indx + v3) >> 1]) + LVFU(rgb[1][(indx + v5) >> 1]) + LVFU(rgb[1][(indx - v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + v1)]) - c8v * LVFU(rgb[0][(indx1 + v2)]))) / c3145680v, zerov, onev);
+ nvv = vclampf(((c23v * LVFU(rgb[1][(indx - v1) >> 1]) + c23v * LVFU(rgb[1][(indx - v3) >> 1]) + LVFU(rgb[1][(indx - v5) >> 1]) + LVFU(rgb[1][(indx + v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - v1)]) - c8v * LVFU(rgb[0][(indx1 - v2)]))) / c3145680v, zerov, onev);
+ evv = vclampf(((c23v * LVFU(rgb[1][(indx + h1) >> 1]) + c23v * LVFU(rgb[1][(indx + h3) >> 1]) + LVFU(rgb[1][(indx + h5) >> 1]) + LVFU(rgb[1][(indx - h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + h1)]) - c8v * LVFU(rgb[0][(indx1 + h2)]))) / c3145680v, zerov, onev);
+ wvv = vclampf(((c23v * LVFU(rgb[1][(indx - h1) >> 1]) + c23v * LVFU(rgb[1][(indx - h3) >> 1]) + LVFU(rgb[1][(indx - h5) >> 1]) + LVFU(rgb[1][(indx + h1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 - h1)]) - c8v * LVFU(rgb[0][(indx1 - h2)]))) / c3145680v, zerov, onev);
+ svv = vclampf(((c23v * LVFU(rgb[1][(indx + v1) >> 1]) + c23v * LVFU(rgb[1][(indx + v3) >> 1]) + LVFU(rgb[1][(indx + v5) >> 1]) + LVFU(rgb[1][(indx - v1) >> 1]) + tempv - c32v * LVFU(rgb[0][(indx1 + v1)]) - c8v * LVFU(rgb[0][(indx1 + v2)]))) / c3145680v, zerov, onev);
//Horizontal and vertical color differences
tempv = LVFU( rgb[0][indx1] ) / c65535v;
_mm_storeu_ps( &vdif[indx1], (sgv * nvv + ngv * svv) / (ngv + sgv) - tempv );
@@ -1561,9 +1561,9 @@ void RawImageSource::igv_interpolate(int winw, int winh)
for (col = 7 + (FC(row, 1) & 1), indx1 = (row * width + col) >> 1, d = FC(row, col) / 2; col < width - 14; col += 8, indx1 += 4) {
//H&V integrated gaussian vector over variance on color differences
//Mod Jacques 3/2013
- ngv = LIMV(epssqv + c78v * SQRV(LVFU(vdif[indx1])) + c69v * (SQRV(LVFU(vdif[indx1 - v1])) + SQRV(LVFU(vdif[indx1 + v1]))) + c51v * (SQRV(LVFU(vdif[indx1 - v2])) + SQRV(LVFU(vdif[indx1 + v2]))) + c21v * (SQRV(LVFU(vdif[indx1 - v3])) + SQRV(LVFU(vdif[indx1 + v3]))) - c6v * SQRV(LVFU(vdif[indx1 - v1]) + LVFU(vdif[indx1]) + LVFU(vdif[indx1 + v1]))
+ ngv = vclampf(epssqv + c78v * SQRV(LVFU(vdif[indx1])) + c69v * (SQRV(LVFU(vdif[indx1 - v1])) + SQRV(LVFU(vdif[indx1 + v1]))) + c51v * (SQRV(LVFU(vdif[indx1 - v2])) + SQRV(LVFU(vdif[indx1 + v2]))) + c21v * (SQRV(LVFU(vdif[indx1 - v3])) + SQRV(LVFU(vdif[indx1 + v3]))) - c6v * SQRV(LVFU(vdif[indx1 - v1]) + LVFU(vdif[indx1]) + LVFU(vdif[indx1 + v1]))
- c10v * (SQRV(LVFU(vdif[indx1 - v2]) + LVFU(vdif[indx1 - v1]) + LVFU(vdif[indx1])) + SQRV(LVFU(vdif[indx1]) + LVFU(vdif[indx1 + v1]) + LVFU(vdif[indx1 + v2]))) - c7v * (SQRV(LVFU(vdif[indx1 - v3]) + LVFU(vdif[indx1 - v2]) + LVFU(vdif[indx1 - v1])) + SQRV(LVFU(vdif[indx1 + v1]) + LVFU(vdif[indx1 + v2]) + LVFU(vdif[indx1 + v3]))), zerov, onev);
- egv = LIMV(epssqv + c78v * SQRV(LVFU(hdif[indx1])) + c69v * (SQRV(LVFU(hdif[indx1 - h1])) + SQRV(LVFU(hdif[indx1 + h1]))) + c51v * (SQRV(LVFU(hdif[indx1 - h2])) + SQRV(LVFU(hdif[indx1 + h2]))) + c21v * (SQRV(LVFU(hdif[indx1 - h3])) + SQRV(LVFU(hdif[indx1 + h3]))) - c6v * SQRV(LVFU(hdif[indx1 - h1]) + LVFU(hdif[indx1]) + LVFU(hdif[indx1 + h1]))
+ egv = vclampf(epssqv + c78v * SQRV(LVFU(hdif[indx1])) + c69v * (SQRV(LVFU(hdif[indx1 - h1])) + SQRV(LVFU(hdif[indx1 + h1]))) + c51v * (SQRV(LVFU(hdif[indx1 - h2])) + SQRV(LVFU(hdif[indx1 + h2]))) + c21v * (SQRV(LVFU(hdif[indx1 - h3])) + SQRV(LVFU(hdif[indx1 + h3]))) - c6v * SQRV(LVFU(hdif[indx1 - h1]) + LVFU(hdif[indx1]) + LVFU(hdif[indx1 + h1]))
- c10v * (SQRV(LVFU(hdif[indx1 - h2]) + LVFU(hdif[indx1 - h1]) + LVFU(hdif[indx1])) + SQRV(LVFU(hdif[indx1]) + LVFU(hdif[indx1 + h1]) + LVFU(hdif[indx1 + h2]))) - c7v * (SQRV(LVFU(hdif[indx1 - h3]) + LVFU(hdif[indx1 - h2]) + LVFU(hdif[indx1 - h1])) + SQRV(LVFU(hdif[indx1 + h1]) + LVFU(hdif[indx1 + h2]) + LVFU(hdif[indx1 + h3]))), zerov, onev);
//Limit chrominance using H/V neighbourhood
nvv = median(d725v * LVFU(vdif[indx1]) + d1375v * LVFU(vdif[indx1 - v1]) + d1375v * LVFU(vdif[indx1 + v1]), LVFU(vdif[indx1 - v1]), LVFU(vdif[indx1 + v1]));
@@ -2114,7 +2114,7 @@ void RawImageSource::nodemosaic(bool bw)
*/
#ifdef __SSE2__
-#define CLIPV(a) LIMV(a,ZEROV,c65535v)
+#define CLIPV(a) vclampf(a,ZEROV,c65535v)
#endif
void RawImageSource::refinement(int PassCount)
{
diff --git a/rtengine/iplabregions.cc b/rtengine/iplabregions.cc
index d5bbd6302..d2380494a 100644
--- a/rtengine/iplabregions.cc
+++ b/rtengine/iplabregions.cc
@@ -204,9 +204,9 @@ BENCHFUN
for (int i = 0; i < n; ++i) {
vfloat blendv = LVFU(abmask[i][y][x]);
vfloat sv = F2V(rs[i]);
- vfloat a_newv = LIMV(sv * (av + F2V(abca[i])), cm42000v, c42000v);
- vfloat b_newv = LIMV(sv * (bv + F2V(abcb[i])), cm42000v, c42000v);
- vfloat l_newv = LIMV(lv * F2V(rl[i]), ZEROV, c32768v);
+ vfloat a_newv = vclampf(sv * (av + F2V(abca[i])), cm42000v, c42000v);
+ vfloat b_newv = vclampf(sv * (bv + F2V(abcb[i])), cm42000v, c42000v);
+ vfloat l_newv = vclampf(lv * F2V(rl[i]), ZEROV, c32768v);
lv = vintpf(LVFU(Lmask[i][y][x]), l_newv, lv);
av = vintpf(blendv, a_newv, av);
bv = vintpf(blendv, b_newv, bv);
diff --git a/rtengine/ipretinex.cc b/rtengine/ipretinex.cc
index e79ee52a3..fa961fbbb 100644
--- a/rtengine/ipretinex.cc
+++ b/rtengine/ipretinex.cc
@@ -504,11 +504,11 @@ void RawImageSource::MSR(float** luminance, float** originalLuminance, float **e
if(useHslLin) {
for (; j < W_L - 3; j += 4) {
- _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv * (LIMV(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
+ _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv * (vclampf(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
}
} else {
for (; j < W_L - 3; j += 4) {
- _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv * xlogf(LIMV(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
+ _mm_storeu_ps(&luminance[i][j], LVFU(luminance[i][j]) + pondv * xlogf(vclampf(LVFU(src[i][j]) / LVFU(out[i][j]), limMinv, limMaxv) ));
}
}
diff --git a/rtengine/sleefsseavx.c b/rtengine/sleefsseavx.c
index c68f11ae0..83d937bd1 100644
--- a/rtengine/sleefsseavx.c
+++ b/rtengine/sleefsseavx.c
@@ -1368,8 +1368,9 @@ static INLINE vfloat xcbrtf(vfloat d) {
return y;
}
-static INLINE vfloat LIMV( vfloat a, vfloat b, vfloat c ) {
-return vmaxf( b, vminf(a,c));
+static INLINE vfloat vclampf(vfloat value, vfloat low, vfloat high) {
+ // clamps value in [low;high], returns low if value is NaN
+ return vmaxf(vminf(high, value), low);
}
static INLINE vfloat SQRV(vfloat a){
@heckflosse
Shall I make it consistent?
That would be beneficial. But please remember that we implemented rtengine::min()
like this for a reason (better code by the compiler, on par with std::min()
), so rtengine::min()
shouldn't be modified.
As always, excellent work, Ingo! :+1:
@Floessie hmm
rtengine::min could also be
template<typename T>
constexpr const T& min(const T& a, const T& b)
{
return a < b ? a : b;
}
which should give the same performance as the current
template<typename T>
constexpr const T& min(const T& a, const T& b)
{
return b < a ? b : a;
}
but would be consistent with vminf.
Same for rtengine::max
AFAIK, the a < b ? a : b
was the initial implementation and we changed it for the better after studying the assembly.
I will delay that for after 5.5 release. For now I will commit vclampf and close the issue. Ok?
Honestly? Layman optimization person here. But does the order really matter for efficiency?
@Thanatomanic I don't think the order should matter for efficiency in this case. But it matters for the result, which is different between the two implementations in case a
is NaN
@heckflosse @Thanatomanic Hm, I only find #3742. I'm pretty sure, I've checked the assembly and GCC optimizes for the b < a
case, but I can re-check it tomorrow.
@Floessie Maybe you checked the >= case ?
https://godbolt.org/ will tell us ;-)
Somehow the screenshot of the SSE2-only version got lost:
So, one more reason to prefer
template<typename T>
constexpr const T& min(const T& a, const T& b)
{
return a < b ? a : b;
}
over
template<typename T>
constexpr const T& min(const T& a, const T& b)
{
return b < a ? b : a;
}
@heckflosse Yeah, you're right. My mind plays tricks on me. I must have mixed it up with something different.
Still interesting, that the STL implements it like we do currently...
@Floessie I guess the STL follows the rule: if one argument is NaN, then the result will be Nan
@heckflosse .. which to me makes sense. If you're comparing to a non-number, still getting a number out is a bit silly.
@Thanatomanic yes, but that's what _mm_min_ps and _mm_max_ps do. It's just a matter of the order of arguments. For that reason I want to get it consistent to get the same behaviour in SSE code and in scalar code which processes the remaining pixels if the length of the input % 4 != 0
@Thanatomanic Well, whether that's silly is a different topic ;-)
@heckflosse Found it: The reason is, the standard says "Returns the first argument when the arguments are equivalent".
@Floessie Isn't that true also for the changed implementation (because the first argument is equal to the second argument) or do I miss something?
If a == b
it takes the false
branch, so b
is returned.
@Thanatomanic The STL does not what you describe, try this:
float nanvalue = 0.f / 0.f;
std::cout << "std::min(nanvalue, 0.f) : " << std::min(nanvalue, 0.f) << std::endl;
std::cout << "std::min(0.f, nanvalue) : " << std::min(0.f, nanvalue) << std::endl;
output on my machine:
std::min(nanvalue, 0.f) : nan
std::min(0.f, nanvalue) : 0
Ok, let's focus on the desired behaviour. I'm for returning value of b in rtengine::min/max(a,b) if a is NaN for consistency reasons.
In STL std::mim/max with one ore two arguments being NaN is undefined behaviour. I would like to get a defined behaviour for rtengine::min/max.
On machines with SSE that can be simply done by this
#ifdef __SSE2__
float rtengine::min(float a,float b) {
return _mm_cvtss_f32(_mm_min_ss(_mm_set_ss(a), _mm_set_ss(b)));
}
#else
something different for non SSE2
#endif
I strongly feel for this reasoning regarding NaN: https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values/1573715
Then again, I have no clue where and how often a NaN comparison might occur in RT code.
There are only a few NaN comparisons in RT code. But we often ran into NaN problems (especially when accessing Lookup tables using NaNs. There has been more than one request to avoid NaN access in Lookup tables (rtengine/LUT.h) and I always voted against solving this in LUT.h for two reasons:
1) degrading performance of LUT.h access methods (that's not an issue with the current change) 2) crashing on LUT NaN access leads to earlier fixes (still valid)
@heckflosse How about this?
diff --git a/rtengine/rt_math.h b/rtengine/rt_math.h
index 9342f5430..b21d26564 100644
--- a/rtengine/rt_math.h
+++ b/rtengine/rt_math.h
@@ -56,6 +56,13 @@ constexpr const T& min(const T& a, const T& b)
return b < a ? b : a;
}
+template<>
+constexpr const float& min(const float& a, const float& b)
+{
+ // For consistency reasons we return b instead of a if a == b or a == NaN
+ return a < b ? a : b;
+}
+
template<typename T, typename... ARGS>
constexpr const T& min(const T& a, const T& b, const ARGS&... args)
{
@@ -74,6 +81,13 @@ constexpr const T& max(const T& a, const T& b)
return a < b ? b : a;
}
+template<>
+constexpr const float& max(const float& a, const float& b)
+{
+ // For consistency reasons we return b instead of a if a == b or a == NaN
+ return b < a ? a : b;
+}
+
template<typename T, typename... ARGS>
constexpr const T& max(const T& a, const T& b, const ARGS&... args)
{
I tried to integrate your SSE2 code, but that would mean loosing constexpr
and even const T&
return types on all templates. :unamused:
@Floessie Looks good :+1: Though I think we should not do that for 5.5 because it's a bit risky.
Assume for example cases of max(0.f, some_variable) where the current implementation returns 0.f if some_variable is NaN while the new version would return NaN. We have to inspect all the cases when we change this.
The SSE intrinsics _mm_min_ps(a, b) and _mm_max_ps(a, b) which are used in vminf(a, b) and vmaxf(a, b) return the second argument (b) if one of the arguments is a NaN.
We can use this fact to make for example the vectorized LUT access methods a bit more robust without impact on performance:
@Floessie Thoughts?