Closed risicle closed 3 months ago
This is a very cool implementation based on SIMDs. That said, to be able to build Dexed on a maximum of platforms, this PR cannot be merged because of highway dependency.
PS: I added your fork on Dexed README.md ; thanks for sharing !
This is an offshoot from some experimentation I was doing using dexed and I wasn't really planning on developing further, but in case it's useful to anyone I'll present it here.
This uses google's highway library to add SIMD versions of the most expensive parts of the synthesis. My crude testing suggests modest speed improvements of 10-20% for SSE2 to AVX2, but on an AVX512 machine this easily doubles speed for me. An ARM NEON system showed an embarrassing 4% acceleration.
Dexed doesn't have a test suite, but comparing the results against the existing scalar implementation showed a maximum relative error of ~0.003 between the two, which will be attributable to a different order of operations in some places.
I don't know whether you'd ever actually want to make dexed depend on libhwy, but this would probably take a bit more polish if you ever wanted to actually merge it - I've tested it only on a limited variety of machines/architectures, haven't included options to disable vectorization support, have only configured libhwy for single-dispatch (no single-binary, dynamic cpu-extension-detecting, but I don't imagine it would be too hard to set that up).
The feedback-based operator loops are way too hard to vectorize, so they are left alone.