Closed mystise closed 1 year ago
Thanks for doing this analysis! Note though that I think it's important not to look at unpack64x256
in isolation, but where it is actually inlined and used.
In any case, things definitely could have changed since I wrote that comment. I probably have a bias toward sticking to the intrinsics because I feel like that gives us a more solid guarantee, but if a transmute is empirically and measurably faster, then I'd be happy to consider that.
https://github.com/BurntSushi/aho-corasick/blob/4e7fa3b85dd3a3ce882896f1d4ee22b1f271f0b4/src/packed/vector.rs#L91-L106
Minimal reproduction:
Resulting assembly (godbolt):
Link: https://rust.godbolt.org/z/jnr81hGnq
It would appear that (as of at minimum Rust 1.60.0, the current MSRV) transmute optimizes to the exact same asm as the intrinsics, but only when the avx2 feature is specifically enabled on the function. (When optimization is disabled, the transmute versions are only a few lines longer but the intrinsic version is over 70 lines of ASM and multiple function calls due to the intrinsics not inlining)
Changing godbolt back to rust 1.27.1 (the earliest stable simd intrinsics build), the asm did used to be different between the two:
Interestingly the modern optimization is using a very minor tweak of the version that was commented as being slower.
Not a bug or a necessary change, just something I found interesting.