Overflow of int_inv_sqrt()

Wurstnase commented 9 years ago

I made an attempt to avoid the overflow of int_inv_sqrt(). https://github.com/Traumflug/Teacup_Firmware/commit/7dc339332f3b3ed3eebef820f1d48c5cedd4d5c4

Wurstnase commented 8 years ago

After some testing, fastest square root for ARM without FPU is

uint32_t isqrt32_2(uint32_t num32)
{
    // 32-bit square root - thanks to Wilco Dijksra for this efficient ARM algorithm
    uint32_t res = 0;

    #define iter32(N)           \
    {                   \
      uint32_t temp = res | (1 << N);   \
      if (num32 >= temp << N)       \
      {                 \
        num32 -= temp << N;       \
        res |= 2 << N;          \
      }                 \
    }

    // We need to do 16 iterations
    iter32(15); iter32(14); iter32(13); iter32(12);
    iter32(11); iter32(10); iter32(9); iter32(8);
    iter32(7); iter32(6); iter32(5); iter32(4);
    iter32(3); iter32(2); iter32(1); iter32(0);

    return res >> 1;
}

fastest reciprocal square root for ARM with FPU is

uint16_t int_inv_sqrt(uint32_t a) {
  float result;

  __ASM volatile ("VCVT.F32.U32 %0, %1;"
                  "VSQRT.F32 %0, %0;"
                  : "=t" (result)
                  : "t" (a));

  return (uint16_t)((4095/result));
}

For very high step rate we should also extend the reciprocal to 13bit precision and replace the 'odd' >> 13 with >>14. Else the resolution is too low for it and you can hear very hard speed steps.

But for AVR I think the 'best' solution should be the int_sqrt() instead of the reciprocal one.

Wurstnase commented 8 years ago

Forget the reciprocal. Just replace all

dda->c = (pgm_read_dword(&c0_P[dda->fast_axis]) *
                    int_inv_sqrt(dda->n)) >> 13;

with

dda->c = (pgm_read_dword(&c0_P[dda->fast_axis]) /
                    (2 * int_sqrt(dda->n)));

and everything is super smooth.

also I made/rename some functions

#if (__FPU_PRESENT == 1) && (__FPU_USED == 1)
  #define int_sqrt_FPU int_sqrt
#else
  #define int_sqrt_MPU int_sqrt
#endif

phord commented 8 years ago

On the Atmel the shift operations are very slow. It can only shift one bit at a time. I think your first version could be reordered to allow it to be optimized better, but maybe the compiler already noticed this and takes care of it for you. I haven't tried it.

In particular the constant shifts are fast (compile-time), so the only potentially slow operation is "temp<<n", I think.

Wurstnase commented 8 years ago

The wilco-algorithm is very fast. Any test or loop produce more overhead and will slow this. I searched again but I can't find a comparison. But when I remeber correctly this algorithm needs 3 cycles per bit with an overhead of 3. 51 cycles for the complete square root is not that much :)

Traumflug commented 8 years ago

Any test or loop produce more overhead and will slow this.

Not a problem as long as you profile each algorithm with the same loop.

Wurstnase commented 8 years ago

A loop itself like for (i=15; i>=0; i--) iter32(i); is slower than the 16 macros in a row.

Wurstnase commented 8 years ago

Here some text about it: https://gist.github.com/Wurstnase/a428edd40768428b0ad1

I can't find the comparison side.

Wurstnase commented 8 years ago

Added the idea for different CPUs. https://github.com/Traumflug/Teacup_Firmware/commit/bdecc8bbdf2bd02d0d0d495bca395470b0dc835c

Wurstnase commented 8 years ago

After thinking about the dda_math.c I think best of it to create dda_math-cortexm4.c and similar. With that we can get best performance for each architecture.

Wurstnase commented 8 years ago

Some tests about the int_inv_sqrt().

On the Nucleo: Binary search ~ 253 cycles 0x1000 / asm_sqrt() ~ 92 cycles Quake ~ 84 cycles

Edit: else without braces outwit me...

Wurstnase commented 8 years ago

I"m a bit confused.

This takes ~ 82 cycles? dda->c = (c0_P[dda->fast_axis] * int_inv_sqrt_n);

This takes ~ 197 cycles? dda->c = (c0_P[dda->fast_axis] * int_inv_sqrt_n) >> 13;

This looks like too much. Maybe simulAVR is wrong here?

Traumflug commented 8 years ago

There are several possible reasons why seemingly small changes can cause large changes in runtime and/or binary size:

Variables can no longer be kept in registers, so they have to be un- and reloaded, which takes 4 clocks per byte.
There is no directly matching assembler instruction. For shifting 8 bits one can copy bytes, for all other shifts all four bytes have to be shifted bit by bit, as ATmegas (AFAIK) have a shift by one bit, only.
Optimisation strategies fail due to the additional complexity.
The optimiser no longer recognizes an optimisation opportunity due to this additional complexity.

If you don't trust SimulAVR the only possible verification method is to upload that code to a real controller and hooking up a scope. With a 1 GHz scope one can easily measure spikes generated like this:

WRITE(DEBUG_LED, 1);
WRITE(DEBUG_LED, 0);

This generates a spike of 2 clock cycles, any additional width of the spike is due to eventual code in between. Most likely this works with a scope sample rate of 500 or even 50 MHz, too.

Wurstnase commented 8 years ago

I will check it with my osci.

~~Could it be faster to use a struct? Like:~~ Forget the idea...

typedef struct {
  union {
    struct {
      uint8_t first;
      uint8_t second;
    }
  uint16_t all;
} my16_t;

my16_t value;
value->all = 0x12ff;
uint8_t new_value = value->first;

Wurstnase commented 8 years ago

Closing this issue and referencing it to https://github.com/Traumflug/Teacup_Firmware/issues/69.

Just remember that (uint16_t)int32_t in int_inv_sqrt() can overflow.

Traumflug / Teacup_Firmware

Overflow of int_inv_sqrt() #191