Open cfriedt opened 4 years ago
Here is a potential candidate for speeding up compile-time const reflections. Of course, runtime reflections would be better suited to machine-specific instructions / inline assembly.
constexpr uint8_t bitrev4(const uint8_t& nibble) {
const uint8_t lut[] = {
/* [0b0000] = */ 0b0000,
/* [0b0001] = */ 0b1000,
/* [0b0010] = */ 0b0100,
/* [0b0011] = */ 0b1100,
/* [0b0100] = */ 0b0010,
/* [0b0101] = */ 0b1010,
/* [0b0110] = */ 0b0110,
/* [0b0111] = */ 0b1110,
/* [0b1000] = */ 0b0001,
/* [0b1001] = */ 0b1001,
/* [0b1010] = */ 0b0101,
/* [0b1011] = */ 0b1101,
/* [0b1100] = */ 0b0011,
/* [0b1101] = */ 0b1011,
/* [0b1110] = */ 0b0111,
/* [0b1111] = */ 0b1111,
};
return lut[nibble & 0xf];
}
constexpr uint8_t bitrev8(const uint8_t& byte) {
return
0
| (bitrev4(byte >> 0) << 4)
| (bitrev4(byte >> 4) << 0)
;
}
template <typename T>
constexpr T bitrev(const T& x)
{
constexpr size_t N = sizeof(x);
T y(0);
for(uint8_t k = 0; k < N; ++k) {
uint8_t a = x >> (k * 8);
uint8_t b = x >> (N - k - 1);
b = bitrev8(b);
a = bitrev8(a);
y |= T(b) << (k * 8);
y |= T(a) << (N - k - 1);
}
return y;
}
Ultimately we would like to avoid reflecting input at runtime.
There is a special case highlighted in #63 - where "reflected tables" are generated instead of specifying "reflect input" and "reflect output".
The generated table is entirely incompatible, but it eliminates any computationally expensive runtime reflection for that special case.
There is a pretty simple way to optimize
reflect()
in the case where N (the number of bits) is a multiple of 4 (or 8). We can use a LUT of length 16 (or 256) where each entry in the LUT is simply the bit-reversed entry of its index. This would mean that allreflect()
calls would beO( log( N ) )
operations rather thanO( N )
. Possibly for embedded systems we may want to use a LUT of length 16, but on most systems 256 bytes is nothing significant.At runtime, reflect could be optimized by using machine-specific instructions. E.g. on ARM this would correspond to
RBIT
and onx86_64
this would correspond to ??. There is a longstanding bug in GCC to provide a compiler builtin specifically for this purpose.https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481
It would be nice to quantize the speedup as well. That will likely involve pre-generating a significant amount of random data of varying word-lengths and then calling the function repeatedly, then dividing the total execution time by the number of calls made to
reflect()
.Doing this with C++ templates is pretty straight forward. I would like to look into C11 generics (issue #13) before this is done though, since that could simplify the C API.