Open JacksonAllan opened 2 months ago
Thank you so much! This makes a lot of sense. According to the commit history, I haven't touched khash.h for six years. I have been mostly using khashl in my recent projects, though most users are still on the older khash.
I will incorporate your improvements at some point, along with a fix to the missing malloc checks you mentioned in #5. Thanks again!
BTW, you can now change the load factor of khashl at the compile time by defining kh_max_count
before inclusion like:
#define kh_max_count(cap) ((cap) * 0.85)
#include "khashl.h"
Nonetheless, plain hash tables like khashl may not work well at high load factor anyway.
Hash tables that store dense metadata in a separate array can usually harness compiler intrinsics that correspond to bit-scan instructions to scan for the next occupied bucket many buckets at a time. Specifically, we can scan a group of
64 / bits_of_metadata_per_bucket
buckets at once in the ideal case that starting position is the beginning of a bucket group. This technique is used, for example, by absl, Martinus’ robin_hood, and Verstable/CC (which can sadly only scan four buckets at at time). Because khashl uses only one bit of metadata per bucket, it seems like the perfect candidate for this optimization.I’ve created a proof of concept for GCC and Clang on little-endian architectures here. The repository is private, but I’ve invited you to it. Here are the results for a
uint32_t
-key,uint32_t
-value table displayed in graph form, courtesy of Google Sheets:(The horizonal axis also corresponds to the number of keys in the table.)
Evidently, the gains are pretty massive. When the table is almost empty, the intrinsics-accelerated version is about 10 times faster. At a load factor of 0.5, it’s about twice as fast. The gains taper off at about 0.75, which is coincidentally khashl’s default maximum load factor.
Implementing this change involves a few steps:
used
array. This change isn’t strictly necessary, but it allows us to scan up to 64 buckets, rather than 32 buckets, at a time. In practice, the performance difference would probably only be apparent in the case that the current load factor is extremely low and many 32-bucket groups are entirely vacant.iterationthe scanning atkh_end
. (A less performant option would be to manually check whether we’ve reached the end of the metadata array and, if so, return an end iterator.)If you’re interested, I could probably do these things and make a pull request. Testing on a big-endian machine is pretty much impossible in this age, though.
khash (as opposed to khashl) could, I think, also benefit from this optimization, although I'm not sure whether you’re even still developing it (in parallel with khashl).