cfriedt / crcx

LibCRCx
https://cfriedt.github.com/crcx
MIT License
1 stars 0 forks source link

Optimize reflect #31

Open cfriedt opened 4 years ago

cfriedt commented 4 years ago

There is a pretty simple way to optimize reflect() in the case where N (the number of bits) is a multiple of 4 (or 8). We can use a LUT of length 16 (or 256) where each entry in the LUT is simply the bit-reversed entry of its index. This would mean that all reflect() calls would be O( log( N ) ) operations rather than O( N ). Possibly for embedded systems we may want to use a LUT of length 16, but on most systems 256 bytes is nothing significant.

At runtime, reflect could be optimized by using machine-specific instructions. E.g. on ARM this would correspond to RBIT and on x86_64 this would correspond to ??. There is a longstanding bug in GCC to provide a compiler builtin specifically for this purpose.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481

It would be nice to quantize the speedup as well. That will likely involve pre-generating a significant amount of random data of varying word-lengths and then calling the function repeatedly, then dividing the total execution time by the number of calls made to reflect().

Doing this with C++ templates is pretty straight forward. I would like to look into C11 generics (issue #13) before this is done though, since that could simplify the C API.

cfriedt commented 4 years ago

Here is a potential candidate for speeding up compile-time const reflections. Of course, runtime reflections would be better suited to machine-specific instructions / inline assembly.

constexpr uint8_t bitrev4(const uint8_t& nibble) {
  const uint8_t lut[] = {
    /* [0b0000] = */ 0b0000,
    /* [0b0001] = */ 0b1000,
    /* [0b0010] = */ 0b0100,
    /* [0b0011] = */ 0b1100,
    /* [0b0100] = */ 0b0010,
    /* [0b0101] = */ 0b1010,
    /* [0b0110] = */ 0b0110,
    /* [0b0111] = */ 0b1110,
    /* [0b1000] = */ 0b0001,
    /* [0b1001] = */ 0b1001,
    /* [0b1010] = */ 0b0101,
    /* [0b1011] = */ 0b1101,
    /* [0b1100] = */ 0b0011,
    /* [0b1101] = */ 0b1011,
    /* [0b1110] = */ 0b0111,
    /* [0b1111] = */ 0b1111,
  };
  return lut[nibble & 0xf];
}

constexpr uint8_t bitrev8(const uint8_t& byte) {
  return
    0
    | (bitrev4(byte >> 0) << 4)
    | (bitrev4(byte >> 4) << 0)
    ;
}

template <typename T>
constexpr T bitrev(const T& x)
{
  constexpr size_t N = sizeof(x);
  T y(0);
  for(uint8_t k = 0; k < N; ++k) {
    uint8_t a = x >> (k * 8);
    uint8_t b = x >> (N - k - 1);
    b = bitrev8(b);
    a = bitrev8(a);
    y |= T(b) << (k * 8);
    y |= T(a) << (N - k - 1);
  }
  return y;
}

Ultimately we would like to avoid reflecting input at runtime.

There is a special case highlighted in #63 - where "reflected tables" are generated instead of specifying "reflect input" and "reflect output".

The generated table is entirely incompatible, but it eliminates any computationally expensive runtime reflection for that special case.