Closed Wurstnase closed 8 years ago
After some testing, fastest square root for ARM without FPU is
uint32_t isqrt32_2(uint32_t num32)
{
// 32-bit square root - thanks to Wilco Dijksra for this efficient ARM algorithm
uint32_t res = 0;
#define iter32(N) \
{ \
uint32_t temp = res | (1 << N); \
if (num32 >= temp << N) \
{ \
num32 -= temp << N; \
res |= 2 << N; \
} \
}
// We need to do 16 iterations
iter32(15); iter32(14); iter32(13); iter32(12);
iter32(11); iter32(10); iter32(9); iter32(8);
iter32(7); iter32(6); iter32(5); iter32(4);
iter32(3); iter32(2); iter32(1); iter32(0);
return res >> 1;
}
fastest reciprocal square root for ARM with FPU is
uint16_t int_inv_sqrt(uint32_t a) {
float result;
__ASM volatile ("VCVT.F32.U32 %0, %1;"
"VSQRT.F32 %0, %0;"
: "=t" (result)
: "t" (a));
return (uint16_t)((4095/result));
}
For very high step rate we should also extend the reciprocal to 13bit precision and replace the 'odd' >> 13
with >>14
. Else the resolution is too low for it and you can hear very hard speed steps.
But for AVR I think the 'best' solution should be the int_sqrt()
instead of the reciprocal one.
Forget the reciprocal. Just replace all
dda->c = (pgm_read_dword(&c0_P[dda->fast_axis]) *
int_inv_sqrt(dda->n)) >> 13;
with
dda->c = (pgm_read_dword(&c0_P[dda->fast_axis]) /
(2 * int_sqrt(dda->n)));
and everything is super smooth.
also I made/rename some functions
#if (__FPU_PRESENT == 1) && (__FPU_USED == 1)
#define int_sqrt_FPU int_sqrt
#else
#define int_sqrt_MPU int_sqrt
#endif
On the Atmel the shift operations are very slow. It can only shift one bit at a time. I think your first version could be reordered to allow it to be optimized better, but maybe the compiler already noticed this and takes care of it for you. I haven't tried it.
In particular the constant shifts are fast (compile-time), so the only potentially slow operation is "temp<<n", I think.
The wilco-algorithm is very fast. Any test or loop produce more overhead and will slow this. I searched again but I can't find a comparison. But when I remeber correctly this algorithm needs 3 cycles per bit with an overhead of 3. 51 cycles for the complete square root is not that much :)
Any test or loop produce more overhead and will slow this.
Not a problem as long as you profile each algorithm with the same loop.
A loop itself like for (i=15; i>=0; i--) iter32(i);
is slower than the 16 macros in a row.
Here some text about it: https://gist.github.com/Wurstnase/a428edd40768428b0ad1
I can't find the comparison side.
Added the idea for different CPUs. https://github.com/Traumflug/Teacup_Firmware/commit/bdecc8bbdf2bd02d0d0d495bca395470b0dc835c
After thinking about the dda_math.c I think best of it to create dda_math-cortexm4.c and similar. With that we can get best performance for each architecture.
Some tests about the int_inv_sqrt().
On the Nucleo: Binary search ~ 253 cycles 0x1000 / asm_sqrt() ~ 92 cycles Quake ~ 84 cycles
Edit: else without braces outwit me...
I"m a bit confused.
This takes ~ 82 cycles? dda->c = (c0_P[dda->fast_axis] * int_inv_sqrt_n);
This takes ~ 197 cycles? dda->c = (c0_P[dda->fast_axis] * int_inv_sqrt_n) >> 13;
This looks like too much. Maybe simulAVR is wrong here?
There are several possible reasons why seemingly small changes can cause large changes in runtime and/or binary size:
If you don't trust SimulAVR the only possible verification method is to upload that code to a real controller and hooking up a scope. With a 1 GHz scope one can easily measure spikes generated like this:
WRITE(DEBUG_LED, 1);
WRITE(DEBUG_LED, 0);
This generates a spike of 2 clock cycles, any additional width of the spike is due to eventual code in between. Most likely this works with a scope sample rate of 500 or even 50 MHz, too.
I will check it with my osci.
Could it be faster to use a struct? Like:
Forget the idea...
typedef struct {
union {
struct {
uint8_t first;
uint8_t second;
}
uint16_t all;
} my16_t;
my16_t value;
value->all = 0x12ff;
uint8_t new_value = value->first;
Closing this issue and referencing it to https://github.com/Traumflug/Teacup_Firmware/issues/69.
Just remember that (uint16_t)int32_t in int_inv_sqrt() can overflow.
I made an attempt to avoid the overflow of int_inv_sqrt(). https://github.com/Traumflug/Teacup_Firmware/commit/7dc339332f3b3ed3eebef820f1d48c5cedd4d5c4