bigger optimizations - Githubissues

the main problem that sampling profilers will not catch is not that there are particular routines that are hot, but that the entire algorithm from top to bottom is run on every sample, guaranteed to clobber the whole instruction cache &c.

i have ideas about how to fix this but unfortunately they are not super easy and lead to more complex and error-prone code (why its easier to design/debug the per-sample version.)

for example: the most expensive single calculation is the hermite interpolation of written values when upsampling (since it must run multiple times per input sample.) it is possible to "hoist" this calculation to the block level, where it can run far more efficiently by unrolling and using SIMD. to do that, all its inputs need to be made available for a whole audio block at once - in this case that means the input ringbuffer must be extended, there must be a vector of fractional phase coefficients and a vector of output buffer indices. (and NB that the size of this vector is not the number of frames in ablock (call this N) but N*R , where R is the maximum upsampling ratio.)

similar approach can be taken to other heavy subroutines (extending necessary state to a whole buffer.) there is a minimal amount of logic (looping, basically) that needs to happen per sample and probably involves branches.

catfact / softcut

bigger optimizations #9