dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.75k forks source link

[Proposal] Vectorized System.Collections.Generic.Dictionary<K, V> #108098

Closed kg closed 1 day ago

kg commented 1 month ago

Proposal: Vectorized System.Collections.Generic.Dictionary<K, V>

See https://github.com/dotnet/runtime/issues/107830 for some context/early measurements, https://github.com/kg/SimdDictionary/ for a prototype, and https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 for a blog post describing a similar hashmap design.

Background and motivation

As part of work to improve startup time for the Mono runtime, I introduced a new vectorized pure-C hashtable implementation called dn_simdhash. We migrated many uses of Mono's old GHashTable to this new container, which delivered sizable reductions in both CPU usage and memory usage. During and after this work, I've had multiple people ask about the feasibility of doing similar work to vectorize the CoreCLR native C++ hash containers or vectorize SCG.Dictionary. I believe that we can do the latter and get improvements to throughput and memory usage, which may also translate to reductions in startup time.

API Proposal / Usage

No public API changes, unless we decide to expose new functionality as a part of this work. I can't think of anything I'd expose offhand.

Enhancements to InlineArray could enable higher performance for this container, but I think that is best left for a separate proposal.

Risks

BDN measurements for 4096-element <Int64, Int64> (on Ryzen unless noted)

High-level design and notes

This dictionary is vectorized by splitting items into 14-entry buckets, where each bucket contains a Vector128 of 'hash suffixes', 8-bit slices of the item's hashcode, alongside key-value pairs in an InlineArray. Once a bucket is selected based on the modulus of the hash and the bucket count, we do a vectorized scan of the 14 suffixes to identify the most likely matching item in three instructions: vpcmpeqb, vpmovmskb, tzcnt. With an optimal hash, the odds of a suffix collision (more than one match in a bucket) are approximately 8% and the odds of a false-positive are approximately 1%. This means that for a non-degraded table, most lookups will exit after a single vectorized suffix check or after calling Equals on 1-2 items. The 2 remaining bytes in the suffix table are used to store the item count and 'cascade count', respectively. Cascade counts are explained further below. i.e.

internal struct Pair {
    public K Key;
    public V Value;
}

[InlineArray(14)]
internal struct InlinePairArray {
    public Pair Pair0;
}

internal struct Bucket {
    public Vector128<byte> Suffixes;
    public InlinePairArray Pairs;

    public ref byte Count =>
        ref Unsafe.AddByteOffset(ref Unsafe.As<Vector128<byte>, byte>(ref Unsafe.AsRef(in Suffixes)), CountSlot);

    public ref byte CascadeCount =>
        ref Unsafe.AddByteOffset(ref Unsafe.As<Vector128<byte>, byte>(ref Unsafe.AsRef(in Suffixes)), CascadeSlot);
}

Unlike current SCG.Dictionary, all data lives in the single buckets array, which delivers better cache locality for scans. We can locate the appropriate bucket with a single imul once the hash has been modulus'd (which is a single bitwise and for power-of-two bucket counts, and uses the existing FastMod for prime counts.) Scanning buckets after this uses nothing other than add/inc operations and address comparisons.

Small numbers of hash collisions have virtually no negative impact, as each bucket can contain up to 14 items with the same hash. Once a bucket fills, it 'cascades' into the next bucket, which is efficiently tracked (for up to 254 cascaded items) with virtually no degradation. Once a single bucket cascades >= 255 times, it will permanently degrade until the table is rehashed. (It's possible to easily detect this and grow the table in response.) This degraded state causes failed lookups to have to search neighboring buckets, since the 'cascaded' flag will remain set until a bucket is cleared.

It is theoretically possible to rehash this container in-place without a temporary array, at the cost of some buckets becoming erroneously degraded.

Null checks and checks for length-0 are optimized out by having empty dictionaries share a common 1-element empty Buckets array.

Find, Insert and Remove operations can be expressed in terms of a common FindKeyInBucket static interface method and LoopingBucketEnumerator struct, which means i.e. TryInsert is a total of 55 lines of C#. Portions of this search logic are not dependent on the types of K or V, which means they can be shared between instantiations. Example FindKey implementation:

internal ref Pair FindKey<TKeySearcher> (K key, IEqualityComparer<K>? comparer)
    where TKeySearcher : struct, IKeySearcher
{
    var hashCode = TKeySearcher.GetHashCode(comparer, key);
    // It's important to construct the enumerator before computing the suffix, to avoid stalls
    var enumerator = new LoopingBucketEnumerator(this, hashCode);
    // Pre-creating the search vector here using Vector128.Create is possible, but it gets stored into xmm6 and causes
    //  an unconditional stack spill at entry, which outweighs any performance benefit from creating it early.
    // vpbroadcastb is 1op/1c latency on most x86-64 chips, so doing it early isn't very useful either.
    // Unfortunately in some cases Vector128.Create(suffix) gets LICM'd into xmm6 too... https://github.com/dotnet/runtime/issues/108092
    var suffix = GetHashSuffix(hashCode);
    do {
        // Eagerly load the bucket count early for pipelining purposes, so we don't stall when using it later.
        int bucketCount = enumerator.bucket.Count,
            startIndex = FindSuffixInBucket(ref enumerator.bucket, suffix, bucketCount);
        ref var pair = ref TKeySearcher.FindKeyInBucket(ref enumerator.bucket, startIndex, bucketCount, comparer, key, out _);
        if (Unsafe.IsNullRef(ref pair)) {
            if (enumerator.bucket.CascadeCount == 0)
                return ref Unsafe.NullRef<Pair>();
        } else
            return ref pair;
    } while (enumerator.Advance());

    return ref Unsafe.NullRef<Pair>();
}

Once a matching pair has been found in a bucket (for find/remove operations) or a bucket with space has been found (for inserts) we can complete the relevant operation in one step without additional address calculations (imul/idiv/mod, etc). If we cascaded out of previous buckets during insert/remove, we scan backwards to update their cascade counters to keep the table in a consistent state, but we don't pay this cost when there are no collisions.

Clear, copy and enumeration operations have to scan every bucket, then check its count to determine whether to touch its contents. This can produce very different performance from SCG depending on the distribution of items and number of items.

Because of the 14-wide buckets and vectorized handling of hash collisions, We can omit caching each item's HashCode in the buckets (which makes each item smaller), and we can allow the load factor to approach 100% without meaningfully degraded lookup performance. This results in reduced memory usage.

Because the entire container's state is a Count field, a GrowAtCount field, a Buckets array, and potentially a FastMod divisor, concurrent modifications are less hazardous than in SCG.Dictionary. With power-of-two bucket counts instead of prime bucket counts, there is no concurrent modification hazard at all for lookup operations. (Prime bucket counts expose the hazard of FastMod producing an out-of-bounds bucket index; SCG.Dictionary doesn't currently handle this so the array access would throw.)

This container does not use any sort of freelist when performing removes or inserts; values are stored in the appropriate bucket instead of sequentially in a separate entries array starting-from-0. This makes certain enumeration operations potentially slower, i.e. clear or copyto when mostly empty.

Potential runtime/type system improvements

An improvement to InlineArray would allow intelligently sizing buckets so their size is always a power of two. This would turn some imuls in this container into bitshifts, and improve cache line alignment for buckets. The most obvious way to do this would be an enhanced version of InlineArray where you request 'as many items as can fit into X bytes' or, even better, '1 <= N <= 14 items to produce the best-aligned structure`. It's possible to manually align buckets for the most common key/value size (8 bytes, for reference-typed keys/values or int64s) but it's not possible to generically specialize the presence/size of padding, so that optimization doesn't generalize in the current type system. This optimization is something that is possible in dn_simdhash and F14 at present.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

jkotas commented 1 month ago

If we go with this, we should validate the perf on real-world workload (e.g. Bing) by measuring total memory and inclusive time taken by Dictionary before/after.

jeffhandley commented 3 weeks ago

@eiriktsarpalis / @kg -- I suggest the two of you connect to chat about this and capture notes for feasibility and potential value, optionally including @tannergooding in the conversation as well.

kg commented 1 day ago

Much thanks to @eiriktsarpalis and @tannergooding for their feedback offline - it appears to be impossible to vectorize SCG.Dictionary at this time while maintaining all of its invariants. Specifically, enumeration over a SCG Dictionary is fully ordered as long as you never remove or replace any items, and maintaining this invariant in a vectorized dictionary is so expensive that the vectorization stops being worthwhile.

Going to close this in favor of a future proposal for a new 'unordered dictionary' container that is vectorized and drops that invariant. If anyone disagrees feel free to reopen :)