Open kg opened 2 months ago
Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.
Current x86-64 measurements from the standalone prototype, for reference: | Type | Method | Mean | Gen0 | Allocated |
---|---|---|---|---|---|
BCLCollisions | AddSameRepeatedlyThenClear | 8.048 us | - | - | |
SimdCollisions | AddSameRepeatedlyThenClear | 10.306 us | - | - | |
BCLCollisions | FillThenClear | 2,100.095 us | - | 2 B | |
SimdCollisions | FillThenClear | 1,834.852 us | - | 1 B | |
BCLCollisions | FindExistingWithCollisions | 3,126.392 us | - | 2 B | |
SimdCollisions | FindExistingWithCollisions | 2,655.567 us | - | 2 B | |
BCLCollisions | FindMissingWithCollisions | 4,120.161 us | - | 3 B | |
SimdCollisions | FindMissingWithCollisions | 3,518.738 us | - | 2 B | |
BCLInsertion | ClearThenRefill | 55.339 us | - | - | |
SimdInsertion | ClearThenRefill | 92.654 us | - | - | |
BCLInsertion | InsertExisting | 32.139 us | - | - | |
SimdInsertion | InsertExisting | 24.198 us | - | - | |
BCLIterate | EnumerateKeys | 13.041 us | 3.8910 | 65632 B | |
SimdIterate | EnumerateKeys | 21.884 us | 3.9063 | 65696 B | |
BCLIterate | EnumeratePairs | 84.610 us | 41.6260 | 131198 B | |
SimdIterate | EnumeratePairs | 76.850 us | 41.6260 | 131222 B | |
BCLIterate | EnumerateValues | 13.124 us | 3.8910 | 65632 B | |
SimdIterate | EnumerateValues | 20.892 us | 3.9063 | 65696 B | |
BCLLookup | FindExisting | 37.693 us | NA | NA | |
SimdLookup | FindExisting | 30.653 us | NA | NA | |
BCLLookup | FindMissing | 30.034 us | NA | NA | |
SimdLookup | FindMissing | 21.917 us | NA | NA | |
BCLRemoval | RemoveItemsThenRefill | 93.145 us | - | - | |
SimdRemoval | RemoveItemsThenRefill | 100.580 us | - | - | |
BCLRemoval | RemoveMissing | 24.898 us | - | - | |
SimdRemoval | RemoveMissing | 29.243 us | - | - | |
BCLResize | CreateDefaultSizeAndFill | 294.347 us | 222.1680 | 942129 B | |
SimdResize | CreateDefaultSizeAndFill | 276.685 us | 213.8672 | 983853 B |
It's roughly a match or superior on most of the measurements I came up with. I'm not sure why remove is slower, but I specifically haven't micro-optimized that method in the prototype. To me, the 20-30% improvement on TryGetValue w/o collisions and the 15% improvement on TryGetValue w/collisions feel like they could justify the time investment.
cc @stephentoub since he provided a lot of useful guidance so far.
Interesting! Curious to see the outcomes, my 5 cents to the topic:
Dictionary
's methods on hot paths (in the first party services) e.g. lookups for Key being string
and the comparer is either default or OrdinalIgnoreCase
. E.g. grep.app says that more than 75% of all dictionaries have string as the Key, see grep.app. Although, this doesn't mean that we should not optimize it for other types, etc.Dictionary<>
may benefit a lot from generic specialization for reference types, e.g. when we do dictionary["key"]
we end up calling a shared FindValue(_Canon)
method + we currently never try to devirtualize EqualityComparer.Default.X
for it.Still, the SIMD generic approach looks promising, not sure I understand it fully yet π
but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation.
I think it would be definitely nice to study and improve in JIT if possible/reasonable, is there a minimal repro for the jit issue?
Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?
For CoreCLR/NativeAOT we have *-arm32, linux-x86, loongarch64 and risc-v64
which don't accelerate it (plus, the community NativeAOT-LLVM target). I presume some Mono targets don't accelerate it either.
i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.
Yes, it is important for security reasons - defense in depth against DoS attacks.
Interesting! Curious to see the outcomes, my 5 cents to the topic:
- I quite often see
Dictionary
's methods on hot paths (in the first party services) e.g. lookups for Key beingstring
and the comparer is either default orOrdinalIgnoreCase
. E.g. grep.app says that more than 75% of all dictionaries have string as the Key, see grep.app. Although, this doesn't mean that we should not optimize it for other types, etc.
It would be interesting to consider doing a special case implementation of everything specifically for string, if it would be such a high bang-for-buck compared to other types. i.e. we keep the _Canon based implementation for every reference type except string.
- It seems that
Dictionary<>
may benefit a lot from generic specialization for reference types, e.g. when we dodictionary["key"]
we end up calling a sharedFindValue(_Canon)
method + we currently never try to devirtualizeEqualityComparer.Default.X
for it.
I'm guessing this is partially on the ryujit side? I saw in the comments for SCG dictionary that we devirt valuetype ec.d.x if you guard it with IsValueType. It would make sense of generic specialization for reference types could help, indeed, since there are so many dictionary implementations that the _Canon method won't devirt the comparer's equals/gethashcode methods.
- The string hashcode can be potentially replaced by XxHash3 (Is it time to replace Marvin? #85206), maybe we can even remove the hack with "NonRandomizedHashCode untill collides"
It sounds like this might be worth investigating separately, since it would improve string perf even for the existing Dictionary.
- Perhaps, some tree-like data structure for collisions instead of O(N)?
The way collisions work in this new data structure is that it's still O(N), but the scan is much faster due to cache locality. The measurement for collisions I posted above is where every item has a hashcode of 0, for example. So it should help for that scenario a lot.
The data structure is also more collision-resistant, because instead of just using the low bits of the hash, it uses both the high and low bits. So it endures low-quality hashes better as well.
Still, the SIMD generic approach looks promising, not sure I understand it fully yet π
For anyone who wants a detailed explanation of some of the principles behind this kind of dictionary, https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 goes into depth (though this/dn-simdhash are not identical in design)
but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation.
I think it would be definitely nice to study and improve in JIT if possible/reasonable, is there a minimal repro for the jit issue?
I can probably roll back to a previous commit and try to build a condensed repro. Most of the problems I've seen so far (it's hard to get 'real' disassembly reliably; disasmo and bdn DisassemblyDiagnoser both break randomly) are from things not staying in registers, and encapsulation via structs seems to consistently spill stuff to the stack. The latter makes sense unless we have structure->many scalars decomposition.
Encapsulating via methods seems to work fine as long as they get inlined properly, in terms of exploiting registers.
I also tried encapsulating by having outer control functions which invoked statics on generic interfaces, but the statics didn't seem to get inlined ever, which made that a non-starter. Maybe I was doing something wrong. i.e. a generic EnumerateBuckets<TCallback>
method which got invoked with a generic type for each operation, FindValueCallback
, InsertCallback
etc. It worked but the lack of inlining made it meaningfully slower.
Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?
For CoreCLR/NativeAOT we have
*-arm32, linux-x86, loongarch64 and risc-v64
which don't accelerate it (plus, the community NativeAOT-LLVM target). I presume some Mono targets don't accelerate it either.
Mono does have V128 support, but it's good to know that we have so many arches without it. Is that still true if you use the arch-specific methods - i.e. the SSE2 methods on linux-x86 - instead of the generic V128 ones? I noticed the arch-specific APIs are missing some intrinsics with the seeming intent that you use V128 instead.
- Majority of dictionaries are very small. Memory consumption of Dictionaries that contain just a few elements is important.
This makes sense. One downside to this approach is that since we're going wide, the minimum capacity of one of these dictionaries is around 14 items; however, the cost-per-item is lower than SCG.Dictionary, so that might work out. It might be possible to shrink the size of the keys/values storage even if the buckets are larger, but that would come at the cost of more allocations/rehash operations when growing.
- Apps tend to have a lot of Dictionary instantiations. The code size of Dictionary instantiation is important.
Encapsulation could help a lot here then, if we could figure out a way to get it to generate good jitcode. There are parts of the core search algorithm that are fully type-agnostic, so if we can find a way to efficiently invoke them instead of inlining them, it would reduce code size a lot.
From looking at the disassembly there are also a lot of stack loads/stores that could be pruned if we figured out a way to keep all the critical stuff in registers.
- It is very important that the Dictionary stays type and memory safe even when user code has a bug and calls the Dictionary method in non-thread safe way. It is ok for the Dictionary itself to get corrupted in this case, but it should not lead to buffer overruns, etc.
It sounds then like it's not acceptable to make liberal use of ref and Unsafe.Add, and we'd need to stick to Span unless we can be certain the ref manipulation is safe. That makes sense.
My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right? So even if a grow+rehash occurs while another thread is iterating the old buffers, we won't crash, we'll just see inconsistent internal state.
i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.
Yes, it is important for security reasons - defense in depth against DoS attacks.
I see, so we'd need a way to detect collision-based attacks and do a rehash with a DoS-resistant comparer. I think it might be possible to detect this at insertion time to keep lookups fast, but I'll have to think about it.
because instead of just using the low bits of the hash, it uses both the high and low bits. So it endures low-quality hashes better as well.
The prime-modulo operation used by the current Dictionary uses all bits of the hash.
My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right?
Yes, that's right.
My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right?
To be precise, managed interior pointers (T&
in IL / ref
in C#) do keep the GC from collecting but not from moving the array. The GC will update the interior pointer when moving the object.
I did some experiments to see how much I can optimize for string keys with the existing intelligence in the JIT. It doesn't seem like I can get EqualityComparer.Default to devirt, and using it for strings is a lot slower than explicitly using a StringComparer. So just adding a special-case for (typeof(K) == typeof(string)) && (comparer == null)
that selects a StringComparer on purpose is a big performance win. Most of the time for string keys is spent calculating hashcodes (in marvin), not comparing string keys, which is interesting.
It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?
I also compared the initial representation (3 buckets/keys/values arrays) with one more like SCG (buckets/entries) and the latter is a bit faster, though not significantly so. The main advantage there is that the success path in TryGetValue becomes a lot faster, so it's probably worth it even if the cache locality for key scans is worse. (In the average case, we won't have to scan more than one key anyway.)
The ideal representation (if the .NET type system supported it) would be buckets/values, where each bucket looks like this:
struct Bucket {
Vector128<byte> Suffixes;
AutoSizedInlineArray<Key> Keys;
}
and where AutoSizedInlineArray<Key>
is an InlineArray that is automatically sized to a multiple of a cache line size, or an explicit threshold. i.e. you'd want the total size of Bucket to be 64/128/192/256 bytes, which would set an arbitrary limit on how many elements the InlineArray can have. Right now the InlineArray size has to be a compile-time constant regardless of the key size, which makes this impossible. (We'd also need a solution for very big keys.)
Without a type system/runtime solution for that, we have to use an entries array and index it separately, which generates more complex code than if the buckets also contained the keys. Keys-in-buckets is what dn_simdhash uses.
While the current version has great performance for Int64 keys, string keys perform way worse for some reason I can't identify. I thought it might be due to string.Equals invocations, but according to profiler data, it's not. I haven't been able to dig into that because disasmo and disassemblydiagnoser both randomly broke again.
P.S. While refactoring this, I was surprised to discover that while Unsafe.ByteOffset
exists, Unsafe.Offset
doesn't. Might be good to try and add that to NET10...
Shouldn't this be moved to discussions?
It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?
Do you mean the NonRandomized* ones? those are an internal implementation detail/an optimization inside Dictionary - we use these fast comparers (with a lot faster hashcode compared to Marvin) untill we have too many collisions, then we switch to the standard comparers (and rehash everything) based on normal hashcode (Marvin) to protect from the attack Jan mentioned above.
I did some experiments to see how much I can optimize for string keys with the existing intelligence in the JIT. It doesn't seem like I can get EqualityComparer.Default to devirt, and using it for strings is a lot slower than explicitly using a StringComparer.
This is a known limitation we need to address in JIT: basically, we can't devirtualize EqualityComparer.Default<_Canon>.Equals
anyhow even with PGO, there was an issue for it somewhere, I'll attach a small repro to this comment later.
UPD: Btw. here I tried to specialize comparers for String keys in Dictionary: https://github.com/EgorBo/runtime-1/commit/1e67807dbbaecf46e5e426d90ec2f07446f432fd by hands π And here is a minimal repro for the jit issue: https://github.com/EgorBot/runtime-utils/issues/91#issuecomment-2353598574
It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?
It has been like that for last 10+ years. We have not productized a public API for this optimization. Also, the way that this optimization has been done changed a few times to make it ever faster, while preserving the defense against DoS.
Is the BCL team interested in seeing a draft PR that reworks SCG Dictionary based on my current research? I don't want to waste time during my hack week if it's unwanted, but I've made enough progress that it might make sense to stop experimenting outside of the BCL and do it inside. It'll take a good amount of time to migrate it in, is the thing.
The TL;DR is that the data is mixed, but the best-case improvement ATM for lookups (w/optimal key hashes) is a 25% reduction. Insertions also get ~5-10% faster. I believe we could get better numbers if people with more optimization experience take a crack at it - I managed to get the lookup hot path entirely into registers, but it could be much better. In past iterations of the prototype I saw 30%. I'm testing with Int64 keys since it's not possible to like-for-like compare with String keys. EDIT 2: I should note that the performance improvement on my Intel system is truly massive (50-75%), but I think the ryzen numbers are more realistic.
Performance on my work machine (an intel laptop) is better than the numbers I'm citing below, but it's hard to run measurements there because it only has 2 P-cores.
SimdDictionary<Int64, Int64>
with 4096 items:EDIT 2: After some refactoring I managed to get decent performance with prime bucket counts, though it's still not as good as power-of-two.
I'm looking into vectorizing S.C.G.Dictionary using Vector128. I'm at the point of figuring out how much work it would be to prototype it in-tree, as I have a basic standalone prototype (based on dn_simdhash) with good microbenchmark numbers. Some questions that come to mind:
EDIT: Adding a link to the F14 blog post, which is a similar 14-wide vectorized dictionary: https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 It explains a lot of the concepts at work here + the tradeoffs quite well. A lot of what they do isn't possible in the current .NET type system.