The transformation of decoder kernels to inline functions (#59) allows us to move the inner decoding loop into separate inline functions.
Because the number of remaining loop iterations is known, we can split calls to the inner loop into long unrolled stretches. Tests show that this can result in a significant speedup.
It seems like unrolling the loops on NEON results in a significant slowdown rather than a speedup, so maybe this branch should be held back until further efficiency improvements are made in the NEON decoders.
The transformation of decoder kernels to inline functions (#59) allows us to move the inner decoding loop into separate inline functions.
Because the number of remaining loop iterations is known, we can split calls to the inner loop into long unrolled stretches. Tests show that this can result in a significant speedup.