Closed MinusKelvin closed 3 years ago
Interesting. That's actually very significant. I'm going to do a little poking around, since it does already compile to a version of the code that used bitcount rather than a direct iteration on -O3
with my current code.
Here's a compiler explorer of the two iteration strategies: https://godbolt.org/z/EjsnjW9EY
I should also mention that I ran the benchmarks without any LTO. Actually, without LTO, there are function calls to single-instruction functions like <u64 as EnumSetTypeRepr>::has_bit
. These functions aren't inlined without LTO since they're calls across compilation units, so they should probably have the #[inline(always)]
attribute to force it. The default for release builds is no LTO.
Here are the benchmark runtime changes I saw with full LTO: Occupancy \ Size | Small | Large | Huge |
---|---|---|---|
Full | +46% | +36% | +97% |
Half | -15% | -20% | +18% |
log2 | -15% | -67% | -84% |
Single | -12% | -92% | -97% |
Empty | -61% | -98% | -99% |
It might be desirable to switch iterator implementations based on the size of the set.
The only use case I can think of where one would expect full (or very full) enumsets in normal use is for x in EnumSet::<T>::all()
or something, so, I think the regression on full sets isn't that bad.
Even on platforms without native trailing_zeros, it doesn't seem to be the most hard operation to emulate, so older ARM chips (etc) shouldn't be an issue: https://godbolt.org/z/cP358edT6
I'll look into this more tomorrow, and see if the hybrid approach works with LTO on.
The primary benefit of this change is that it makes the overhead of iterating over the contents of an enum set proportional to its size, rather than the bit width of the set. This is a significant improvement for low-occupancy sets.
The performance improvements are slightly greater when compiled with the
bmi1
target feature, which enables the native trailing zero count instruction. I don't have a way to benchmark on non-x86_64 platforms, so I don't know how this affects performance on e.g. ARM.You can find my benchmarks here.