Closed jackmott closed 6 years ago
That looks about right. I think rounding isn't available on SSE2, though, so faster will probably be pretty slow on that benchmark. You're also using the chunk of the API which doesn't make any alignment/length assumptions, so that may introduce some additional overhead as well.
Once I crank it up to AVX2 it's on par with mine exactly, its just the ceil instruction slowing it down with sse2, so great work! Its neat to see all that iterator magic compile away to nothing.
closing because looks like I have it sorted, thanks!
I am doing some benchmarking of my own simd lib against faster and want to be sure I'm doing it correctly. I'm using criterion, replicating the "lots of 3s" example, as shown in this gist:
https://gist.github.com/jackmott/a0b8ca811d2cf2ecb97a35f0aee0a5c6
I'm using the default compilation settings which should be targeting SSE2 instructions for Faster, and I'm using the SSE2 settings in my library. Does this look like a fair comparison? Am I missing anything?
Also how is ceil implemented for SSE2? I think it is slower than it needs to be but I can't figure out where it happens in the faster source.