Closed squarewave closed 1 year ago
This is very clever. I think I buy it.
I think the thing to do now is to fix the build errors, apply it to the rest of the routines and run the benchmark suite.
(Sorry for the radio silence on this. I'm intending to fix this patch and implement it for the rest of the routines, just haven't had the spare time yet.)
Closing due to inactivity.
If someone wants to pick this back up, I think I'd be open to it. I'd like to possibly see these things split out into their own functions since they're pretty beefy. And ideally, I'd want to make sure our existing test coverage is good enough to push on these.
This should be extended to other contexts if others are able to observe the same gains I was able to observe locally. See comment changes for an explanation of what's going on here, but basically we can avoid some looping if we eat an initial extra branch on whether our length is greater than our loop size. We can apply a similar optimization to the AVX2 case, and to memchr2 and friends.