Lymia / enumset

A library for compact bit sets containing enums.
Apache License 2.0
91 stars 35 forks source link

Use trailing zeros count for iteration #26

Closed MinusKelvin closed 3 years ago

MinusKelvin commented 3 years ago

The primary benefit of this change is that it makes the overhead of iterating over the contents of an enum set proportional to its size, rather than the bit width of the set. This is a significant improvement for low-occupancy sets.

The performance improvements are slightly greater when compiled with the bmi1 target feature, which enables the native trailing zero count instruction. I don't have a way to benchmark on non-x86_64 platforms, so I don't know how this affects performance on e.g. ARM.

You can find my benchmarks here.

Lymia commented 3 years ago

Interesting. That's actually very significant. I'm going to do a little poking around, since it does already compile to a version of the code that used bitcount rather than a direct iteration on -O3 with my current code.

MinusKelvin commented 3 years ago

Here's a compiler explorer of the two iteration strategies: https://godbolt.org/z/EjsnjW9EY

I should also mention that I ran the benchmarks without any LTO. Actually, without LTO, there are function calls to single-instruction functions like <u64 as EnumSetTypeRepr>::has_bit. These functions aren't inlined without LTO since they're calls across compilation units, so they should probably have the #[inline(always)] attribute to force it. The default for release builds is no LTO.

Here are the benchmark runtime changes I saw with full LTO: Occupancy \ Size Small Large Huge
Full +46% +36% +97%
Half -15% -20% +18%
log2 -15% -67% -84%
Single -12% -92% -97%
Empty -61% -98% -99%

It might be desirable to switch iterator implementations based on the size of the set.

Lymia commented 3 years ago

The only use case I can think of where one would expect full (or very full) enumsets in normal use is for x in EnumSet::<T>::all() or something, so, I think the regression on full sets isn't that bad.

Even on platforms without native trailing_zeros, it doesn't seem to be the most hard operation to emulate, so older ARM chips (etc) shouldn't be an issue: https://godbolt.org/z/cP358edT6

Lymia commented 3 years ago

I'll look into this more tomorrow, and see if the hybrid approach works with LTO on.