Speed up the VADPCM decoder and (when it exists) encoder. See #2.
Currently this is committed to the sse2 branch, which is a temporary branch that will be squashed or rebased out of existence. It won't be merged until it is more production-ready.
Needs compile-time conditionals so it's not built on systems without SSE2.
May need run-time checks to check if SSE2 exists, although it might make more sense to assume that SSE2 is available on any x86 platform.
Needs test suite to check for correctness.
Needs benchmarking to show that it's actually faster.
Alternative with _mm_madd_epi16() may actually be faster - multiply and add in one step, process two samples at one time.
Note that frame scale values > 12 will overflow this implementation. This could be fixed, but it may also be worth considering that scale values > 12 are even valid, and whether we could just consider it ill-formed data.
Speed up the VADPCM decoder and (when it exists) encoder. See #2.
Currently this is committed to the sse2 branch, which is a temporary branch that will be squashed or rebased out of existence. It won't be merged until it is more production-ready.
_mm_madd_epi16()
may actually be faster - multiply and add in one step, process two samples at one time.Note that frame scale values > 12 will overflow this implementation. This could be fixed, but it may also be worth considering that scale values > 12 are even valid, and whether we could just consider it ill-formed data.