Traumflug / Teacup_Firmware

Firmware for RepRap and other 3D printers
http://forums.reprap.org/read.php?147
GNU General Public License v2.0
312 stars 199 forks source link

Overflow of int_inv_sqrt() #191

Closed Wurstnase closed 8 years ago

Wurstnase commented 9 years ago

I made an attempt to avoid the overflow of int_inv_sqrt(). https://github.com/Traumflug/Teacup_Firmware/commit/7dc339332f3b3ed3eebef820f1d48c5cedd4d5c4

Wurstnase commented 8 years ago

After some testing, fastest square root for ARM without FPU is

uint32_t isqrt32_2(uint32_t num32)
{
    // 32-bit square root - thanks to Wilco Dijksra for this efficient ARM algorithm
    uint32_t res = 0;

    #define iter32(N)           \
    {                   \
      uint32_t temp = res | (1 << N);   \
      if (num32 >= temp << N)       \
      {                 \
        num32 -= temp << N;       \
        res |= 2 << N;          \
      }                 \
    }

    // We need to do 16 iterations
    iter32(15); iter32(14); iter32(13); iter32(12);
    iter32(11); iter32(10); iter32(9); iter32(8);
    iter32(7); iter32(6); iter32(5); iter32(4);
    iter32(3); iter32(2); iter32(1); iter32(0);

    return res >> 1;
}

fastest reciprocal square root for ARM with FPU is

uint16_t int_inv_sqrt(uint32_t a) {
  float result;

  __ASM volatile ("VCVT.F32.U32 %0, %1;"
                  "VSQRT.F32 %0, %0;"
                  : "=t" (result)
                  : "t" (a));

  return (uint16_t)((4095/result));
}

For very high step rate we should also extend the reciprocal to 13bit precision and replace the 'odd' >> 13 with >>14. Else the resolution is too low for it and you can hear very hard speed steps.

But for AVR I think the 'best' solution should be the int_sqrt() instead of the reciprocal one.

Wurstnase commented 8 years ago

Forget the reciprocal. Just replace all

dda->c = (pgm_read_dword(&c0_P[dda->fast_axis]) *
                    int_inv_sqrt(dda->n)) >> 13;

with

dda->c = (pgm_read_dword(&c0_P[dda->fast_axis]) /
                    (2 * int_sqrt(dda->n)));

and everything is super smooth.

also I made/rename some functions

#if (__FPU_PRESENT == 1) && (__FPU_USED == 1)
  #define int_sqrt_FPU int_sqrt
#else
  #define int_sqrt_MPU int_sqrt
#endif
phord commented 8 years ago

On the Atmel the shift operations are very slow. It can only shift one bit at a time. I think your first version could be reordered to allow it to be optimized better, but maybe the compiler already noticed this and takes care of it for you. I haven't tried it.

In particular the constant shifts are fast (compile-time), so the only potentially slow operation is "temp<<n", I think.

Wurstnase commented 8 years ago

The wilco-algorithm is very fast. Any test or loop produce more overhead and will slow this. I searched again but I can't find a comparison. But when I remeber correctly this algorithm needs 3 cycles per bit with an overhead of 3. 51 cycles for the complete square root is not that much :)

Traumflug commented 8 years ago

Any test or loop produce more overhead and will slow this.

Not a problem as long as you profile each algorithm with the same loop.

Wurstnase commented 8 years ago

A loop itself like for (i=15; i>=0; i--) iter32(i); is slower than the 16 macros in a row.

Wurstnase commented 8 years ago

Here some text about it: https://gist.github.com/Wurstnase/a428edd40768428b0ad1

I can't find the comparison side.

Wurstnase commented 8 years ago

Added the idea for different CPUs. https://github.com/Traumflug/Teacup_Firmware/commit/bdecc8bbdf2bd02d0d0d495bca395470b0dc835c

Wurstnase commented 8 years ago

After thinking about the dda_math.c I think best of it to create dda_math-cortexm4.c and similar. With that we can get best performance for each architecture.

Wurstnase commented 8 years ago

Some tests about the int_inv_sqrt().

On the Nucleo: Binary search ~ 253 cycles 0x1000 / asm_sqrt() ~ 92 cycles Quake ~ 84 cycles

Edit: else without braces outwit me...

Wurstnase commented 8 years ago

I"m a bit confused.

This takes ~ 82 cycles? dda->c = (c0_P[dda->fast_axis] * int_inv_sqrt_n);

This takes ~ 197 cycles? dda->c = (c0_P[dda->fast_axis] * int_inv_sqrt_n) >> 13;

This looks like too much. Maybe simulAVR is wrong here?

Traumflug commented 8 years ago

There are several possible reasons why seemingly small changes can cause large changes in runtime and/or binary size:

If you don't trust SimulAVR the only possible verification method is to upload that code to a real controller and hooking up a scope. With a 1 GHz scope one can easily measure spikes generated like this:

WRITE(DEBUG_LED, 1);
WRITE(DEBUG_LED, 0);

This generates a spike of 2 clock cycles, any additional width of the spike is due to eventual code in between. Most likely this works with a scope sample rate of 500 or even 50 MHz, too.

Wurstnase commented 8 years ago

I will check it with my osci.

Could it be faster to use a struct? Like: Forget the idea...

typedef struct {
  union {
    struct {
      uint8_t first;
      uint8_t second;
    }
  uint16_t all;
} my16_t;

my16_t value;
value->all = 0x12ff;
uint8_t new_value = value->first;
Wurstnase commented 8 years ago

Closing this issue and referencing it to https://github.com/Traumflug/Teacup_Firmware/issues/69.

Just remember that (uint16_t)int32_t in int_inv_sqrt() can overflow.