Closed kg closed 1 day ago
Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.
If we go with this, we should validate the perf on real-world workload (e.g. Bing) by measuring total memory and inclusive time taken by Dictionary before/after.
@eiriktsarpalis / @kg -- I suggest the two of you connect to chat about this and capture notes for feasibility and potential value, optionally including @tannergooding in the conversation as well.
Much thanks to @eiriktsarpalis and @tannergooding for their feedback offline - it appears to be impossible to vectorize SCG.Dictionary at this time while maintaining all of its invariants. Specifically, enumeration over a SCG Dictionary is fully ordered as long as you never remove or replace any items, and maintaining this invariant in a vectorized dictionary is so expensive that the vectorization stops being worthwhile.
Going to close this in favor of a future proposal for a new 'unordered dictionary' container that is vectorized and drops that invariant. If anyone disagrees feel free to reopen :)
Proposal: Vectorized System.Collections.Generic.Dictionary<K, V>
See https://github.com/dotnet/runtime/issues/107830 for some context/early measurements, https://github.com/kg/SimdDictionary/ for a prototype, and https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 for a blog post describing a similar hashmap design.
Background and motivation
As part of work to improve startup time for the Mono runtime, I introduced a new vectorized pure-C hashtable implementation called dn_simdhash. We migrated many uses of Mono's old GHashTable to this new container, which delivered sizable reductions in both CPU usage and memory usage. During and after this work, I've had multiple people ask about the feasibility of doing similar work to vectorize the CoreCLR native C++ hash containers or vectorize SCG.Dictionary. I believe that we can do the latter and get improvements to throughput and memory usage, which may also translate to reductions in startup time.
API Proposal / Usage
No public API changes, unless we decide to expose new functionality as a part of this work. I can't think of anything I'd expose offhand.
Enhancements to InlineArray could enable higher performance for this container, but I think that is best left for a separate proposal.
Risks
BDN measurements for 4096-element
<Int64, Int64>
(on Ryzen unless noted)High-level design and notes
This dictionary is vectorized by splitting items into 14-entry buckets, where each bucket contains a
Vector128
of 'hash suffixes', 8-bit slices of the item's hashcode, alongside key-value pairs in anInlineArray
. Once a bucket is selected based on the modulus of the hash and the bucket count, we do a vectorized scan of the 14 suffixes to identify the most likely matching item in three instructions:vpcmpeqb, vpmovmskb, tzcnt
. With an optimal hash, the odds of a suffix collision (more than one match in a bucket) are approximately 8% and the odds of a false-positive are approximately 1%. This means that for a non-degraded table, most lookups will exit after a single vectorized suffix check or after callingEquals
on 1-2 items. The 2 remaining bytes in the suffix table are used to store the item count and 'cascade count', respectively. Cascade counts are explained further below. i.e.Unlike current SCG.Dictionary, all data lives in the single buckets array, which delivers better cache locality for scans. We can locate the appropriate bucket with a single imul once the hash has been modulus'd (which is a single bitwise
and
for power-of-two bucket counts, and uses the existingFastMod
for prime counts.) Scanning buckets after this uses nothing other thanadd/inc
operations and address comparisons.Small numbers of hash collisions have virtually no negative impact, as each bucket can contain up to 14 items with the same hash. Once a bucket fills, it 'cascades' into the next bucket, which is efficiently tracked (for up to 254 cascaded items) with virtually no degradation. Once a single bucket cascades >= 255 times, it will permanently degrade until the table is rehashed. (It's possible to easily detect this and grow the table in response.) This degraded state causes failed lookups to have to search neighboring buckets, since the 'cascaded' flag will remain set until a bucket is cleared.
It is theoretically possible to rehash this container in-place without a temporary array, at the cost of some buckets becoming erroneously degraded.
Null checks and checks for length-0 are optimized out by having empty dictionaries share a common 1-element empty Buckets array.
Find, Insert and Remove operations can be expressed in terms of a common
FindKeyInBucket
static interface method andLoopingBucketEnumerator
struct, which means i.e.TryInsert
is a total of 55 lines of C#. Portions of this search logic are not dependent on the types of K or V, which means they can be shared between instantiations. ExampleFindKey
implementation:Once a matching pair has been found in a bucket (for find/remove operations) or a bucket with space has been found (for inserts) we can complete the relevant operation in one step without additional address calculations (
imul/idiv/mod
, etc). If we cascaded out of previous buckets during insert/remove, we scan backwards to update their cascade counters to keep the table in a consistent state, but we don't pay this cost when there are no collisions.Clear, copy and enumeration operations have to scan every bucket, then check its count to determine whether to touch its contents. This can produce very different performance from SCG depending on the distribution of items and number of items.
Because of the 14-wide buckets and vectorized handling of hash collisions, We can omit caching each item's HashCode in the buckets (which makes each item smaller), and we can allow the load factor to approach 100% without meaningfully degraded lookup performance. This results in reduced memory usage.
Because the entire container's state is a
Count
field, aGrowAtCount
field, aBuckets
array, and potentially aFastMod
divisor, concurrent modifications are less hazardous than in SCG.Dictionary. With power-of-two bucket counts instead of prime bucket counts, there is no concurrent modification hazard at all for lookup operations. (Prime bucket counts expose the hazard of FastMod producing an out-of-bounds bucket index; SCG.Dictionary doesn't currently handle this so the array access would throw.)This container does not use any sort of freelist when performing removes or inserts; values are stored in the appropriate bucket instead of sequentially in a separate entries array starting-from-0. This makes certain enumeration operations potentially slower, i.e. clear or copyto when mostly empty.
Potential runtime/type system improvements
An improvement to InlineArray would allow intelligently sizing buckets so their size is always a power of two. This would turn some imuls in this container into bitshifts, and improve cache line alignment for buckets. The most obvious way to do this would be an enhanced version of InlineArray where you request 'as many items as can fit into X bytes' or, even better, '1 <= N <= 14 items to produce the best-aligned structure`. It's possible to manually align buckets for the most common key/value size (8 bytes, for reference-typed keys/values or int64s) but it's not possible to generically specialize the presence/size of padding, so that optimization doesn't generalize in the current type system. This optimization is something that is possible in dn_simdhash and F14 at present.