adafruit / Adafruit_MP3

mp3 decoding on arduino
40 stars 17 forks source link

added optimization for 64-bit shift and clip to 16-bits, nets 5-6% improvement #9

Closed bitbank2 closed 4 years ago

bitbank2 commented 4 years ago

Most of the decode time is spent in the PolyphaseStereo() and PolyphaseMono() functions doing 64-bit integer math. The SIMD instructions of the Cortex-M4 take care of most of that, but the 64-bit shift right followed by clip to 16-bits had room for improvement. I added an inline asm function to shave off a few cycles.

ladyada commented 4 years ago

@bitbank2 hi please rebase to get the travis CI fixes :)

jepler commented 4 years ago

In general, I think that the state of the art of compilers has advanced a lot since src/assembly.h was written, and it doesn't hurt to check whether these fancy wrappers are still needed. It feels like assuming gcc or a compiler with optimization parity with gcc is not that outlandish.

MULSHIFT32 and MADD64 get sensible results when just coded in C, __builtin_clz uses the ARM clz instruction directly, but __builtin_abs creates a branching form.

Using a Programmer's Delight C implementation for FASTABS gives just 2 instructions, but they're both 32-bits in thumb mode:

int FASTABS1(int x) {
    int y = (x >> 31);
    return (x ^ y) - y;
}

gives

ea80 73e0   eor.w   r3, r0, r0, asr #31
eba3 70e0   sub.w   r0, r3, r0, asr #31