Use trailing zeros count for iteration

MinusKelvin commented 3 years ago

The primary benefit of this change is that it makes the overhead of iterating over the contents of an enum set proportional to its size, rather than the bit width of the set. This is a significant improvement for low-occupancy sets.

For full enum sets, I saw a change in runtime of +10% to +25% depending on the bit width of the set.
For half-full enum sets, I saw a change in runtime of -25% to -40% depending on the bit width of the set.
For single element and empty sets, I saw a change in runtime of -50% to -99% depending on the bit width of the set. Particularly in the case of sets with large bit widths, the time dropped from hundreds of nanoseconds to single-digit nanoseconds.

The performance improvements are slightly greater when compiled with the bmi1 target feature, which enables the native trailing zero count instruction. I don't have a way to benchmark on non-x86_64 platforms, so I don't know how this affects performance on e.g. ARM.

You can find my benchmarks here.

Lymia commented 3 years ago

Interesting. That's actually very significant. I'm going to do a little poking around, since it does already compile to a version of the code that used bitcount rather than a direct iteration on -O3 with my current code.

MinusKelvin commented 3 years ago

Here's a compiler explorer of the two iteration strategies: https://godbolt.org/z/EjsnjW9EY

I should also mention that I ran the benchmarks without any LTO. Actually, without LTO, there are function calls to single-instruction functions like <u64 as EnumSetTypeRepr>::has_bit. These functions aren't inlined without LTO since they're calls across compilation units, so they should probably have the #[inline(always)] attribute to force it. The default for release builds is no LTO.

Here are the benchmark runtime changes I saw with full LTO: Occupancy \ Size	Small	Large	Huge
Full	+46%	+36%	+97%
Half	-15%	-20%	+18%
log2	-15%	-67%	-84%
Single	-12%	-92%	-97%
Empty	-61%	-98%	-99%

It might be desirable to switch iterator implementations based on the size of the set.

Lymia commented 3 years ago

The only use case I can think of where one would expect full (or very full) enumsets in normal use is for x in EnumSet::<T>::all() or something, so, I think the regression on full sets isn't that bad.

Even on platforms without native trailing_zeros, it doesn't seem to be the most hard operation to emulate, so older ARM chips (etc) shouldn't be an issue: https://godbolt.org/z/cP358edT6

Lymia commented 3 years ago

I'll look into this more tomorrow, and see if the hybrid approach works with LTO on.

Lymia / enumset

Use trailing zeros count for iteration #26