What are the acceptance/success criteria for an improved S.C.G.Dictionary?

kg commented 2 months ago

I'm looking into vectorizing S.C.G.Dictionary using Vector128. I'm at the point of figuring out how much work it would be to prototype it in-tree, as I have a basic standalone prototype (based on dn_simdhash) with good microbenchmark numbers. Some questions that come to mind:

How comprehensive is our automatic benchmark coverage for dictionaries? i.e. corner cases like bad hash functions, memory usage in specific scenarios, etc.
How comprehensive is our test coverage for some of the nuances I see marked with comments in the current implementation? And
Are all of the nuances in the current implementation ship requirements? i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.
Is it reasonable if the number of GetHashCode or Equals calls on the comparer change due to the different algorithm?
Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?
Is it reasonable for memory usage to increase at certain collection sizes as long as it decreases at most sizes? Or do we need to try and match memory usage at all sizes even if it makes performance less ideal?
Is the speed of allocating a new dictionary as important as the speed of add/trygetvalue/remove operations?
Is it important for any new implementation to have roughly equivalent (or smaller) code size than the existing one, since there are typically lots of different instantiations of the dictionary type + its enumerator?
Is code hygiene a high priority for BCL classes like this? My original prototype was nicely encapsulated, but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation. This makes the resulting code more fragile to maintain, but ideally it won't need to change much. The code for the existing Dictionary is somewhat messy for performance reasons, but this one might end up worse due to the way SIMD bloats method bodies.
Would higher dictionary performance be impactful enough to justify ryujit and/or type system changes to support it? My instinct is that dictionary perf has wide impacts, but it would need to be an order of magnitude faster to justify that kind of investment.
Which targets do we need to support before it's mergeable? Is x86(64)-with-SSE2 and ARM-with-Neon good enough? Or do we need to hit other targets like RISCV and WASM as well?
Is the BCL team comfortable adopting a new implementation of a container like this that's harder to maintain (due to the need for an understanding of SIMD)?

EDIT: Adding a link to the F14 blog post, which is a similar 14-wide vectorized dictionary: https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 It explains a lot of the concepts at work here + the tradeoffs quite well. A lot of what they do isn't possible in the current .NET type system.

dotnet-policy-service[bot] commented 2 months ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

kg commented 2 months ago

Current x86-64 measurements from the standalone prototype, for reference:	Type	Method	Mean	Gen0
BCLCollisions	AddSameRepeatedlyThenClear	8.048 us	-	-
SimdCollisions	AddSameRepeatedlyThenClear	10.306 us	-	-
BCLCollisions	FillThenClear	2,100.095 us	-	2 B
SimdCollisions	FillThenClear	1,834.852 us	-	1 B
BCLCollisions	FindExistingWithCollisions	3,126.392 us	-	2 B
SimdCollisions	FindExistingWithCollisions	2,655.567 us	-	2 B
BCLCollisions	FindMissingWithCollisions	4,120.161 us	-	3 B
SimdCollisions	FindMissingWithCollisions	3,518.738 us	-	2 B
BCLInsertion	ClearThenRefill	55.339 us	-	-
SimdInsertion	ClearThenRefill	92.654 us	-	-
BCLInsertion	InsertExisting	32.139 us	-	-
SimdInsertion	InsertExisting	24.198 us	-	-
BCLIterate	EnumerateKeys	13.041 us	3.8910	65632 B
SimdIterate	EnumerateKeys	21.884 us	3.9063	65696 B
BCLIterate	EnumeratePairs	84.610 us	41.6260	131198 B
SimdIterate	EnumeratePairs	76.850 us	41.6260	131222 B
BCLIterate	EnumerateValues	13.124 us	3.8910	65632 B
SimdIterate	EnumerateValues	20.892 us	3.9063	65696 B
BCLLookup	FindExisting	37.693 us	NA	NA
SimdLookup	FindExisting	30.653 us	NA	NA
BCLLookup	FindMissing	30.034 us	NA	NA
SimdLookup	FindMissing	21.917 us	NA	NA
BCLRemoval	RemoveItemsThenRefill	93.145 us	-	-
SimdRemoval	RemoveItemsThenRefill	100.580 us	-	-
BCLRemoval	RemoveMissing	24.898 us	-	-
SimdRemoval	RemoveMissing	29.243 us	-	-
BCLResize	CreateDefaultSizeAndFill	294.347 us	222.1680	942129 B
SimdResize	CreateDefaultSizeAndFill	276.685 us	213.8672	983853 B

It's roughly a match or superior on most of the measurements I came up with. I'm not sure why remove is slower, but I specifically haven't micro-optimized that method in the prototype. To me, the 20-30% improvement on TryGetValue w/o collisions and the 15% improvement on TryGetValue w/collisions feel like they could justify the time investment.

kg commented 2 months ago

cc @stephentoub since he provided a lot of useful guidance so far.

EgorBo commented 2 months ago

Interesting! Curious to see the outcomes, my 5 cents to the topic:

I quite often see Dictionary's methods on hot paths (in the first party services) e.g. lookups for Key being string and the comparer is either default or OrdinalIgnoreCase. E.g. grep.app says that more than 75% of all dictionaries have string as the Key, see grep.app. Although, this doesn't mean that we should not optimize it for other types, etc.
It seems that Dictionary<> may benefit a lot from generic specialization for reference types, e.g. when we do dictionary["key"] we end up calling a shared FindValue(_Canon) method + we currently never try to devirtualize EqualityComparer.Default.X for it.
The string hashcode can be potentially replaced by XxHash3 (https://github.com/dotnet/runtime/issues/85206), maybe we can even remove the hack with "NonRandomizedHashCode untill collides"
Perhaps, some tree-like data structure for collisions instead of O(N)?

Still, the SIMD generic approach looks promising, not sure I understand it fully yet 🙂

but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation.

I think it would be definitely nice to study and improve in JIT if possible/reasonable, is there a minimal repro for the jit issue?

Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?

For CoreCLR/NativeAOT we have *-arm32, linux-x86, loongarch64 and risc-v64 which don't accelerate it (plus, the community NativeAOT-LLVM target). I presume some Mono targets don't accelerate it either.

jkotas commented 2 months ago

Majority of dictionaries are very small. Memory consumption of Dictionaries that contain just a few elements is important.
Apps tend to have a lot of Dictionary instantiations. The code size of Dictionary instantiation is important.
It is very important that the Dictionary stays type and memory safe even when user code has a bug and calls the Dictionary method in non-thread safe way. It is ok for the Dictionary itself to get corrupted in this case, but it should not lead to buffer overruns, etc.

jkotas commented 2 months ago

i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.

Yes, it is important for security reasons - defense in depth against DoS attacks.

kg commented 2 months ago

Interesting! Curious to see the outcomes, my 5 cents to the topic:

I quite often see Dictionary's methods on hot paths (in the first party services) e.g. lookups for Key being string and the comparer is either default or OrdinalIgnoreCase. E.g. grep.app says that more than 75% of all dictionaries have string as the Key, see grep.app. Although, this doesn't mean that we should not optimize it for other types, etc.

It would be interesting to consider doing a special case implementation of everything specifically for string, if it would be such a high bang-for-buck compared to other types. i.e. we keep the _Canon based implementation for every reference type except string.

It seems that Dictionary<> may benefit a lot from generic specialization for reference types, e.g. when we do dictionary["key"] we end up calling a shared FindValue(_Canon) method + we currently never try to devirtualize EqualityComparer.Default.X for it.

I'm guessing this is partially on the ryujit side? I saw in the comments for SCG dictionary that we devirt valuetype ec.d.x if you guard it with IsValueType. It would make sense of generic specialization for reference types could help, indeed, since there are so many dictionary implementations that the _Canon method won't devirt the comparer's equals/gethashcode methods.

The string hashcode can be potentially replaced by XxHash3 (Is it time to replace Marvin? #85206), maybe we can even remove the hack with "NonRandomizedHashCode untill collides"

It sounds like this might be worth investigating separately, since it would improve string perf even for the existing Dictionary.

Perhaps, some tree-like data structure for collisions instead of O(N)?

The way collisions work in this new data structure is that it's still O(N), but the scan is much faster due to cache locality. The measurement for collisions I posted above is where every item has a hashcode of 0, for example. So it should help for that scenario a lot.

The data structure is also more collision-resistant, because instead of just using the low bits of the hash, it uses both the high and low bits. So it endures low-quality hashes better as well.

Still, the SIMD generic approach looks promising, not sure I understand it fully yet 🙂

For anyone who wants a detailed explanation of some of the principles behind this kind of dictionary, https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 goes into depth (though this/dn-simdhash are not identical in design)

but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation.

I think it would be definitely nice to study and improve in JIT if possible/reasonable, is there a minimal repro for the jit issue?

I can probably roll back to a previous commit and try to build a condensed repro. Most of the problems I've seen so far (it's hard to get 'real' disassembly reliably; disasmo and bdn DisassemblyDiagnoser both break randomly) are from things not staying in registers, and encapsulation via structs seems to consistently spill stuff to the stack. The latter makes sense unless we have structure->many scalars decomposition.

Encapsulating via methods seems to work fine as long as they get inlined properly, in terms of exploiting registers.

I also tried encapsulating by having outer control functions which invoked statics on generic interfaces, but the statics didn't seem to get inlined ever, which made that a non-starter. Maybe I was doing something wrong. i.e. a generic EnumerateBuckets<TCallback> method which got invoked with a generic type for each operation, FindValueCallback, InsertCallback etc. It worked but the lack of inlining made it meaningfully slower.

Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?

For CoreCLR/NativeAOT we have *-arm32, linux-x86, loongarch64 and risc-v64 which don't accelerate it (plus, the community NativeAOT-LLVM target). I presume some Mono targets don't accelerate it either.

Mono does have V128 support, but it's good to know that we have so many arches without it. Is that still true if you use the arch-specific methods - i.e. the SSE2 methods on linux-x86 - instead of the generic V128 ones? I noticed the arch-specific APIs are missing some intrinsics with the seeming intent that you use V128 instead.

Majority of dictionaries are very small. Memory consumption of Dictionaries that contain just a few elements is important.

This makes sense. One downside to this approach is that since we're going wide, the minimum capacity of one of these dictionaries is around 14 items; however, the cost-per-item is lower than SCG.Dictionary, so that might work out. It might be possible to shrink the size of the keys/values storage even if the buckets are larger, but that would come at the cost of more allocations/rehash operations when growing.

Apps tend to have a lot of Dictionary instantiations. The code size of Dictionary instantiation is important.

Encapsulation could help a lot here then, if we could figure out a way to get it to generate good jitcode. There are parts of the core search algorithm that are fully type-agnostic, so if we can find a way to efficiently invoke them instead of inlining them, it would reduce code size a lot.

From looking at the disassembly there are also a lot of stack loads/stores that could be pruned if we figured out a way to keep all the critical stuff in registers.

It is very important that the Dictionary stays type and memory safe even when user code has a bug and calls the Dictionary method in non-thread safe way. It is ok for the Dictionary itself to get corrupted in this case, but it should not lead to buffer overruns, etc.

It sounds then like it's not acceptable to make liberal use of ref and Unsafe.Add, and we'd need to stick to Span unless we can be certain the ref manipulation is safe. That makes sense.

My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right? So even if a grow+rehash occurs while another thread is iterating the old buffers, we won't crash, we'll just see inconsistent internal state.

i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.

Yes, it is important for security reasons - defense in depth against DoS attacks.

I see, so we'd need a way to detect collision-based attacks and do a rehash with a DoS-resistant comparer. I think it might be possible to detect this at insertion time to keep lookups fast, but I'll have to think about it.

jkotas commented 2 months ago

because instead of just using the low bits of the hash, it uses both the high and low bits. So it endures low-quality hashes better as well.

The prime-modulo operation used by the current Dictionary uses all bits of the hash.

My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right?

~~Yes, that's right.~~

Suchiman commented 2 months ago

My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right?

To be precise, managed interior pointers (T& in IL / ref in C#) do keep the GC from collecting but not from moving the array. The GC will update the interior pointer when moving the object.

kg commented 1 month ago

I did some experiments to see how much I can optimize for string keys with the existing intelligence in the JIT. It doesn't seem like I can get EqualityComparer.Default to devirt, and using it for strings is a lot slower than explicitly using a StringComparer. So just adding a special-case for (typeof(K) == typeof(string)) && (comparer == null) that selects a StringComparer on purpose is a big performance win. Most of the time for string keys is spent calculating hashcodes (in marvin), not comparing string keys, which is interesting.

It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?

I also compared the initial representation (3 buckets/keys/values arrays) with one more like SCG (buckets/entries) and the latter is a bit faster, though not significantly so. The main advantage there is that the success path in TryGetValue becomes a lot faster, so it's probably worth it even if the cache locality for key scans is worse. (In the average case, we won't have to scan more than one key anyway.)

The ideal representation (if the .NET type system supported it) would be buckets/values, where each bucket looks like this:

struct Bucket {
  Vector128<byte> Suffixes;
  AutoSizedInlineArray<Key> Keys;
}

and where AutoSizedInlineArray<Key> is an InlineArray that is automatically sized to a multiple of a cache line size, or an explicit threshold. i.e. you'd want the total size of Bucket to be 64/128/192/256 bytes, which would set an arbitrary limit on how many elements the InlineArray can have. Right now the InlineArray size has to be a compile-time constant regardless of the key size, which makes this impossible. (We'd also need a solution for very big keys.)

Without a type system/runtime solution for that, we have to use an entries array and index it separately, which generates more complex code than if the buckets also contained the keys. Keys-in-buckets is what dn_simdhash uses.

While the current version has great performance for Int64 keys, string keys perform way worse for some reason I can't identify. I thought it might be due to string.Equals invocations, but according to profiler data, it's not. I haven't been able to dig into that because disasmo and disassemblydiagnoser both randomly broke again.

P.S. While refactoring this, I was surprised to discover that while Unsafe.ByteOffset exists, Unsafe.Offset doesn't. Might be good to try and add that to NET10...

julealgon commented 1 month ago

Shouldn't this be moved to discussions?

EgorBo commented 1 month ago

It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?

Do you mean the NonRandomized* ones? those are an internal implementation detail/an optimization inside Dictionary - we use these fast comparers (with a lot faster hashcode compared to Marvin) untill we have too many collisions, then we switch to the standard comparers (and rehash everything) based on normal hashcode (Marvin) to protect from the attack Jan mentioned above.

I did some experiments to see how much I can optimize for string keys with the existing intelligence in the JIT. It doesn't seem like I can get EqualityComparer.Default to devirt, and using it for strings is a lot slower than explicitly using a StringComparer.

This is a known limitation we need to address in JIT: basically, we can't devirtualize EqualityComparer.Default<_Canon>.Equals anyhow even with PGO, there was an issue for it somewhere, I'll attach a small repro to this comment later.

UPD: Btw. here I tried to specialize comparers for String keys in Dictionary: https://github.com/EgorBo/runtime-1/commit/1e67807dbbaecf46e5e426d90ec2f07446f432fd by hands 🙂 And here is a minimal repro for the jit issue: https://github.com/EgorBot/runtime-utils/issues/91#issuecomment-2353598574

jkotas commented 1 month ago

It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?

It has been like that for last 10+ years. We have not productized a public API for this optimization. Also, the way that this optimization has been done changed a few times to make it ever faster, while preserving the defense against DoS.

kg commented 1 month ago

Is the BCL team interested in seeing a draft PR that reworks SCG Dictionary based on my current research? I don't want to waste time during my hack week if it's unwanted, but I've made enough progress that it might make sense to stop experimenting outside of the BCL and do it inside. It'll take a good amount of time to migrate it in, is the thing.

The TL;DR is that the data is mixed, but the best-case improvement ATM for lookups (w/optimal key hashes) is a 25% reduction. Insertions also get ~5-10% faster. I believe we could get better numbers if people with more optimization experience take a crack at it - I managed to get the lookup hot path entirely into registers, but it could be much better. In past iterations of the prototype I saw 30%. I'm testing with Int64 keys since it's not possible to like-for-like compare with String keys. EDIT 2: I should note that the performance improvement on my Intel system is truly massive (50-75%), but I think the ryzen numbers are more realistic.

Performance on my work machine (an intel laptop) is better than the numbers I'm citing below, but it's hard to run measurements there because it only has 2 P-cores.

Summary of current data on my 7950X for `SimdDictionary<Int64, Int64>` with 4096 items:

Average memory usage (sum of all capacities across 0-8192) is the same as SCG.Dictionary. For certain sizes it's smaller, for others it's bigger, because the bucket sizing is different due to vectorization. We might need a solution for very small dictionaries to stay competitive with SCG there. If we use prime bucket counts it uses less memory than SCG, ~~but prime bucket counts are Slow~~. (The average memory usage test also shows lower execution time than SCG.)
jitcode size for common methods is roughly equivalent to SCG without duplicated loops for the struct-default-comparer optimization. Loop duplication brings it to roughly double.
In "collision resistant" mode (avalanches the key's hashcode manually), for lookups optimal hashes are ~1.0x SCG, hashcode==0 is ~1.02x, and carefully-suboptimal hashes are ~0.94x-1.89x SCG.
In non-resistant mode, lookup time for optimal hashes is ~0.75x SCG, for hashcode==0 it's ~0.95x SCG, and for carefully-suboptimal hashes it's ~1.8x-3.2x SCG.
The rest of this is for non-resistant mode, since it is the common case
Failed insertions are ~0.75x SCG. Successful insertions are ~0.96x SCG.
Failed lookups are ~0.74x SCG. Successful lookups are ~0.78x SCG.
Removals are ~1.02 - 1.07x SCG.
ContainsValue is ~0.60x SCG.
Populating an empty dictionary starting from an initial capacity of 0 is ~0.92x SCG.
Clearing an empty dictionary is == SCG, clearing with one item is ~2.0x SCG, and full dictionary is ~1.31x SCG. Not sure why this is so bad.
Enumerator is ~0.88x SCG. Keys/Values' enumerator is ~1.48x SCG (I haven't optimized them.)
Lookup performance w/o SIMD is not meaningfully worse than SCG.
Rehashing could probably be significantly improved, I haven't tried to optimize it.
EDIT: on intel, failed lookups are ~0.27x of SCG, successful lookups are ~0.41x, failed inserts are ~0.40x, and failed removes are ~0.75x.
Other notes
The total LoC for the complete prototype implementation is 1401, where 470 are executable code. It implements most of SCG's surface including part of AlternateLookup and the DefaultEqualityComparer optimization, but doesn't have the anti-DoS rehashing mitigation for strings. (Collision resistant mode is the current solution for that.)
I eventually figured out how to use static interface methods for encapsulation here, which is responsible for the code size being smaller than SCG.
I made an effort to ensure it handles concurrent modification (i.e. corruption) by other threads without crashing.
The prototype has ifdefs to switch between prime/power-of-two bucket counts and collision resistant/non-resistant modes.
I haven't implemented serialization. It seems like it could be tricky to stay backwards/forwards compatible there.

EDIT 2: After some refactoring I managed to get decent performance with prime bucket counts, though it's still not as good as power-of-two.

dotnet / runtime