dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.32k stars 4.74k forks source link

What are the acceptance/success criteria for an improved S.C.G.Dictionary? #107830

Open kg opened 2 months ago

kg commented 2 months ago

I'm looking into vectorizing S.C.G.Dictionary using Vector128. I'm at the point of figuring out how much work it would be to prototype it in-tree, as I have a basic standalone prototype (based on dn_simdhash) with good microbenchmark numbers. Some questions that come to mind:

EDIT: Adding a link to the F14 blog post, which is a similar 14-wide vectorized dictionary: https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 It explains a lot of the concepts at work here + the tradeoffs quite well. A lot of what they do isn't possible in the current .NET type system.

dotnet-policy-service[bot] commented 2 months ago

Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.

kg commented 2 months ago
Current x86-64 measurements from the standalone prototype, for reference: Type Method Mean Gen0 Allocated
BCLCollisions AddSameRepeatedlyThenClear 8.048 us - -
SimdCollisions AddSameRepeatedlyThenClear 10.306 us - -
BCLCollisions FillThenClear 2,100.095 us - 2 B
SimdCollisions FillThenClear 1,834.852 us - 1 B
BCLCollisions FindExistingWithCollisions 3,126.392 us - 2 B
SimdCollisions FindExistingWithCollisions 2,655.567 us - 2 B
BCLCollisions FindMissingWithCollisions 4,120.161 us - 3 B
SimdCollisions FindMissingWithCollisions 3,518.738 us - 2 B
BCLInsertion ClearThenRefill 55.339 us - -
SimdInsertion ClearThenRefill 92.654 us - -
BCLInsertion InsertExisting 32.139 us - -
SimdInsertion InsertExisting 24.198 us - -
BCLIterate EnumerateKeys 13.041 us 3.8910 65632 B
SimdIterate EnumerateKeys 21.884 us 3.9063 65696 B
BCLIterate EnumeratePairs 84.610 us 41.6260 131198 B
SimdIterate EnumeratePairs 76.850 us 41.6260 131222 B
BCLIterate EnumerateValues 13.124 us 3.8910 65632 B
SimdIterate EnumerateValues 20.892 us 3.9063 65696 B
BCLLookup FindExisting 37.693 us NA NA
SimdLookup FindExisting 30.653 us NA NA
BCLLookup FindMissing 30.034 us NA NA
SimdLookup FindMissing 21.917 us NA NA
BCLRemoval RemoveItemsThenRefill 93.145 us - -
SimdRemoval RemoveItemsThenRefill 100.580 us - -
BCLRemoval RemoveMissing 24.898 us - -
SimdRemoval RemoveMissing 29.243 us - -
BCLResize CreateDefaultSizeAndFill 294.347 us 222.1680 942129 B
SimdResize CreateDefaultSizeAndFill 276.685 us 213.8672 983853 B

It's roughly a match or superior on most of the measurements I came up with. I'm not sure why remove is slower, but I specifically haven't micro-optimized that method in the prototype. To me, the 20-30% improvement on TryGetValue w/o collisions and the 15% improvement on TryGetValue w/collisions feel like they could justify the time investment.

kg commented 2 months ago

cc @stephentoub since he provided a lot of useful guidance so far.

EgorBo commented 2 months ago

Interesting! Curious to see the outcomes, my 5 cents to the topic:

  1. I quite often see Dictionary's methods on hot paths (in the first party services) e.g. lookups for Key being string and the comparer is either default or OrdinalIgnoreCase. E.g. grep.app says that more than 75% of all dictionaries have string as the Key, see grep.app. Although, this doesn't mean that we should not optimize it for other types, etc.
  2. It seems that Dictionary<> may benefit a lot from generic specialization for reference types, e.g. when we do dictionary["key"] we end up calling a shared FindValue(_Canon) method + we currently never try to devirtualize EqualityComparer.Default.X for it.
  3. The string hashcode can be potentially replaced by XxHash3 (https://github.com/dotnet/runtime/issues/85206), maybe we can even remove the hack with "NonRandomizedHashCode untill collides"
  4. Perhaps, some tree-like data structure for collisions instead of O(N)?

Still, the SIMD generic approach looks promising, not sure I understand it fully yet πŸ™‚

but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation.

I think it would be definitely nice to study and improve in JIT if possible/reasonable, is there a minimal repro for the jit issue?

Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?

For CoreCLR/NativeAOT we have *-arm32, linux-x86, loongarch64 and risc-v64 which don't accelerate it (plus, the community NativeAOT-LLVM target). I presume some Mono targets don't accelerate it either.

jkotas commented 2 months ago
jkotas commented 2 months ago

i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.

Yes, it is important for security reasons - defense in depth against DoS attacks.

kg commented 2 months ago

Interesting! Curious to see the outcomes, my 5 cents to the topic:

  1. I quite often see Dictionary's methods on hot paths (in the first party services) e.g. lookups for Key being string and the comparer is either default or OrdinalIgnoreCase. E.g. grep.app says that more than 75% of all dictionaries have string as the Key, see grep.app. Although, this doesn't mean that we should not optimize it for other types, etc.

It would be interesting to consider doing a special case implementation of everything specifically for string, if it would be such a high bang-for-buck compared to other types. i.e. we keep the _Canon based implementation for every reference type except string.

  1. It seems that Dictionary<> may benefit a lot from generic specialization for reference types, e.g. when we do dictionary["key"] we end up calling a shared FindValue(_Canon) method + we currently never try to devirtualize EqualityComparer.Default.X for it.

I'm guessing this is partially on the ryujit side? I saw in the comments for SCG dictionary that we devirt valuetype ec.d.x if you guard it with IsValueType. It would make sense of generic specialization for reference types could help, indeed, since there are so many dictionary implementations that the _Canon method won't devirt the comparer's equals/gethashcode methods.

  1. The string hashcode can be potentially replaced by XxHash3 (Is it time to replace Marvin? #85206), maybe we can even remove the hack with "NonRandomizedHashCode untill collides"

It sounds like this might be worth investigating separately, since it would improve string perf even for the existing Dictionary.

  1. Perhaps, some tree-like data structure for collisions instead of O(N)?

The way collisions work in this new data structure is that it's still O(N), but the scan is much faster due to cache locality. The measurement for collisions I posted above is where every item has a hashcode of 0, for example. So it should help for that scenario a lot.

The data structure is also more collision-resistant, because instead of just using the low bits of the hash, it uses both the high and low bits. So it endures low-quality hashes better as well.

Still, the SIMD generic approach looks promising, not sure I understand it fully yet πŸ™‚

For anyone who wants a detailed explanation of some of the principles behind this kind of dictionary, https://engineering.fb.com/2019/04/25/developer-tools/f14/?r=1 goes into depth (though this/dn-simdhash are not identical in design)

but this made the code generated by ryujit much worse compared to manually duplicating logic for each operation.

I think it would be definitely nice to study and improve in JIT if possible/reasonable, is there a minimal repro for the jit issue?

I can probably roll back to a previous commit and try to build a condensed repro. Most of the problems I've seen so far (it's hard to get 'real' disassembly reliably; disasmo and bdn DisassemblyDiagnoser both break randomly) are from things not staying in registers, and encapsulation via structs seems to consistently spill stuff to the stack. The latter makes sense unless we have structure->many scalars decomposition.

Encapsulating via methods seems to work fine as long as they get inlined properly, in terms of exploiting registers.

I also tried encapsulating by having outer control functions which invoked statics on generic interfaces, but the statics didn't seem to get inlined ever, which made that a non-starter. Maybe I was doing something wrong. i.e. a generic EnumerateBuckets<TCallback> method which got invoked with a generic type for each operation, FindValueCallback, InsertCallback etc. It worked but the lack of inlining made it meaningfully slower.

Is it reasonable to assume basic Vector128 support on most targets, and accept that a scalar implementation might be slightly slower than the current one?

For CoreCLR/NativeAOT we have *-arm32, linux-x86, loongarch64 and risc-v64 which don't accelerate it (plus, the community NativeAOT-LLVM target). I presume some Mono targets don't accelerate it either.

Mono does have V128 support, but it's good to know that we have so many arches without it. Is that still true if you use the arch-specific methods - i.e. the SSE2 methods on linux-x86 - instead of the generic V128 ones? I noticed the arch-specific APIs are missing some intrinsics with the seeming intent that you use V128 instead.

  • Majority of dictionaries are very small. Memory consumption of Dictionaries that contain just a few elements is important.

This makes sense. One downside to this approach is that since we're going wide, the minimum capacity of one of these dictionaries is around 14 items; however, the cost-per-item is lower than SCG.Dictionary, so that might work out. It might be possible to shrink the size of the keys/values storage even if the buckets are larger, but that would come at the cost of more allocations/rehash operations when growing.

  • Apps tend to have a lot of Dictionary instantiations. The code size of Dictionary instantiation is important.

Encapsulation could help a lot here then, if we could figure out a way to get it to generate good jitcode. There are parts of the core search algorithm that are fully type-agnostic, so if we can find a way to efficiently invoke them instead of inlining them, it would reduce code size a lot.

From looking at the disassembly there are also a lot of stack loads/stores that could be pruned if we figured out a way to keep all the critical stuff in registers.

  • It is very important that the Dictionary stays type and memory safe even when user code has a bug and calls the Dictionary method in non-thread safe way. It is ok for the Dictionary itself to get corrupted in this case, but it should not lead to buffer overruns, etc.

It sounds then like it's not acceptable to make liberal use of ref and Unsafe.Add, and we'd need to stick to Span unless we can be certain the ref manipulation is safe. That makes sense.

My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right? So even if a grow+rehash occurs while another thread is iterating the old buffers, we won't crash, we'll just see inconsistent internal state.

i.e. changing which comparer is used for strings once buckets become unbalanced; detecting concurrent modification; the Key and Values collections being classes instead of structs.

Yes, it is important for security reasons - defense in depth against DoS attacks.

I see, so we'd need a way to detect collision-based attacks and do a rehash with a DoS-resistant comparer. I think it might be possible to detect this at insertion time to keep lookups fast, but I'll have to think about it.

jkotas commented 2 months ago

because instead of just using the low bits of the hash, it uses both the high and low bits. So it endures low-quality hashes better as well.

The prime-modulo operation used by the current Dictionary uses all bits of the hash.

My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right?

Yes, that's right.

Suchiman commented 2 months ago

My understanding is that interior pointers to arrays with 'ref' will keep the GC from collecting or moving the arrays, is that right?

To be precise, managed interior pointers (T& in IL / ref in C#) do keep the GC from collecting but not from moving the array. The GC will update the interior pointer when moving the object.

kg commented 1 month ago

I did some experiments to see how much I can optimize for string keys with the existing intelligence in the JIT. It doesn't seem like I can get EqualityComparer.Default to devirt, and using it for strings is a lot slower than explicitly using a StringComparer. So just adding a special-case for (typeof(K) == typeof(string)) && (comparer == null) that selects a StringComparer on purpose is a big performance win. Most of the time for string keys is spent calculating hashcodes (in marvin), not comparing string keys, which is interesting.

It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?

I also compared the initial representation (3 buckets/keys/values arrays) with one more like SCG (buckets/entries) and the latter is a bit faster, though not significantly so. The main advantage there is that the success path in TryGetValue becomes a lot faster, so it's probably worth it even if the cache locality for key scans is worse. (In the average case, we won't have to scan more than one key anyway.)

The ideal representation (if the .NET type system supported it) would be buckets/values, where each bucket looks like this:

struct Bucket {
  Vector128<byte> Suffixes;
  AutoSizedInlineArray<Key> Keys;
}

and where AutoSizedInlineArray<Key> is an InlineArray that is automatically sized to a multiple of a cache line size, or an explicit threshold. i.e. you'd want the total size of Bucket to be 64/128/192/256 bytes, which would set an arbitrary limit on how many elements the InlineArray can have. Right now the InlineArray size has to be a compile-time constant regardless of the key size, which makes this impossible. (We'd also need a solution for very big keys.)

Without a type system/runtime solution for that, we have to use an entries array and index it separately, which generates more complex code than if the buckets also contained the keys. Keys-in-buckets is what dn_simdhash uses.

While the current version has great performance for Int64 keys, string keys perform way worse for some reason I can't identify. I thought it might be due to string.Equals invocations, but according to profiler data, it's not. I haven't been able to dig into that because disasmo and disassemblydiagnoser both randomly broke again.

P.S. While refactoring this, I was surprised to discover that while Unsafe.ByteOffset exists, Unsafe.Offset doesn't. Might be good to try and add that to NET10...

julealgon commented 1 month ago

Shouldn't this be moved to discussions?

EgorBo commented 1 month ago

It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?

Do you mean the NonRandomized* ones? those are an internal implementation detail/an optimization inside Dictionary - we use these fast comparers (with a lot faster hashcode compared to Marvin) untill we have too many collisions, then we switch to the standard comparers (and rehash everything) based on normal hashcode (Marvin) to protect from the attack Jan mentioned above.

I did some experiments to see how much I can optimize for string keys with the existing intelligence in the JIT. It doesn't seem like I can get EqualityComparer.Default to devirt, and using it for strings is a lot slower than explicitly using a StringComparer.

This is a known limitation we need to address in JIT: basically, we can't devirtualize EqualityComparer.Default<_Canon>.Equals anyhow even with PGO, there was an issue for it somewhere, I'll attach a small repro to this comment later.

UPD: Btw. here I tried to specialize comparers for String keys in Dictionary: https://github.com/EgorBo/runtime-1/commit/1e67807dbbaecf46e5e426d90ec2f07446f432fd by hands πŸ™‚ And here is a minimal repro for the jit issue: https://github.com/EgorBot/runtime-utils/issues/91#issuecomment-2353598574

jkotas commented 1 month ago

It looks like the special string equality comparers used by S.C.G.Dictionary aren't public in the BCL right now, which is weird. Is that only in NET9 previews?

It has been like that for last 10+ years. We have not productized a public API for this optimization. Also, the way that this optimization has been done changed a few times to make it ever faster, while preserving the defense against DoS.

kg commented 1 month ago

Is the BCL team interested in seeing a draft PR that reworks SCG Dictionary based on my current research? I don't want to waste time during my hack week if it's unwanted, but I've made enough progress that it might make sense to stop experimenting outside of the BCL and do it inside. It'll take a good amount of time to migrate it in, is the thing.

The TL;DR is that the data is mixed, but the best-case improvement ATM for lookups (w/optimal key hashes) is a 25% reduction. Insertions also get ~5-10% faster. I believe we could get better numbers if people with more optimization experience take a crack at it - I managed to get the lookup hot path entirely into registers, but it could be much better. In past iterations of the prototype I saw 30%. I'm testing with Int64 keys since it's not possible to like-for-like compare with String keys. EDIT 2: I should note that the performance improvement on my Intel system is truly massive (50-75%), but I think the ryzen numbers are more realistic.

Performance on my work machine (an intel laptop) is better than the numbers I'm citing below, but it's hard to run measurements there because it only has 2 P-cores.

Summary of current data on my 7950X for SimdDictionary<Int64, Int64> with 4096 items:

EDIT 2: After some refactoring I managed to get decent performance with prime bucket counts, though it's still not as good as power-of-two.