Closed arunchndr closed 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @eiriktsarpalis See info in area-owners.md if you want to be subscribed.
Author: | arkalyanms |
---|---|
Assignees: | - |
Labels: | `area-System.Collections`, `untriaged` |
Milestone: | - |
especially when you look at the GetValue numbers
Dictionary uses a hash map and lookup is O(1). ImmutableDictionary uses a tree and lookup is O(log N).
there is a steady regression going all the way back to version 1.5 of the dictionary
You mean the same operation with the same inputs has been getting slower and slower with each version of the nuget package?
cc: @adamsitnik
Dictionary uses a hash map and lookup is O(1). ImmutableDictionary uses a tree and lookup is O(log N)
I understand, but lets compare the Dictionary and the Tree(ImmutableDictionary) states with 1 value loaded in them, so the tree doesn't have to look beyond the root. It still compared as 6ns vs 48ns respectively.
You mean the same operation with the same inputs has been getting slower and slower with each version of the nuget package?
Yes, I picked version 1.5 to run the same benchmarks and values were quite a bit lower. Below is a perf trace comparison of the 2 versions with similar data:
Version 5.0: Name | Exc % | Exc | Inc % | Inc | Fold | When | First | Last |
---|---|---|---|---|---|---|---|---|
|| | |+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].get_Item(!0) | 9.8 | 749 | 71.6 | 5,469 | 0 | _18888888787878888878877877761__ | 260.353 | 7,053.892 |
**|| | ||+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,!1&) | 16.3 | 1,246 | 61.1 | 4,668 | 0 | _17777677677667667676766666651__ | 260.353 | 7,053.892** |
|| | |||+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,value class MutationInput,!1&) | 22.8 | 1,739 | 38.7 | 2,953 | 0 | _24444444444444444444444444431__ | 263.329 | 7,053.892 |
|| | ||||+ system.collections.immutable!System.Collections.Immutable.SortedInt32KeyNode1[System.Collections.Immutable.ImmutableDictionary 2+HashBucket[System.Int32,System.__Canon]].TryGetValue(int32,!0&) |
5.9 | 450 | 7.4 | 566 | 0 | _00011000100000000000000100100__ | 274.230 | 7,045.933 |
|| | |||||+ coreclr!? | 1.5 | 112 | 1.5 | 113 | 0 | _0000000_00oo00_0o0oo00000o00o__ | 325.751 | 7,015.911 |
|| | ||||||+ ntoskrnl!? | 0.0 | 1 | 0.0 | 1 | 0 | _____o__ | 3,192.499 | 3,193.499 |
|| | |||||+ ntoskrnl!? | 0.0 | 2 | 0.0 | 3 | 0 | ____o_oo | 4,883.337 | 6,700.499 |
|| | ||||+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.Int32,System.__Canon].TryGetValue(!0,class Comparers,!1&) | 5.9 | 453 | 7.0 | 531 | 0 | _00000000000000010000000000000__ | 263.329 | 7,050.929 |
Version 1.5
Name | Exc % | Exc | Inc % | Inc | Fold | When | First | Last |
---|---|---|---|---|---|---|---|---|
|| | | + BenchmarkDotNet!Engine.RunIteration | 0.0 | 0 | 65.2 | 7,523 | 0 | _099999899989999999999980___ | 1,337.406 | 9,566.603 |
|| | | + 709f9c6b-ef0d-4179-9d68-e29fe7348f34!Runnable_1.WorkloadActionUnroll | 2.2 | 251 | 65.1 | 7,513 | 0 | _099999899989999998999980___ | 1,337.406 | 9,566.603 |
|| | | |+ system.collections.immutable!System.SR..cctor() | 0.8 | 93 | 58.9 | 6,801 | 0 | _o88888888888888888888870___ | 1,337.406 | 9,566.603 |
|| | | ||+ system.collections.immutable!SecureObjectPool.NewId | 8.4 | 975 | 58.1 | 6,708 | 0 | _088888788888888887878870___ | 1,337.406 | 9,566.603 |
**|| | | |+ immutabledictionary.benchmark!GetValue.ImmutableDictionary | 2.8 | 326 | 2.8 | 326 | 0 | _00000000000000000000000o___ | 1,351.878 | 9,533.597** |
|| | | |+ coreclr!? | 0.8 | 97 | 0.8 | 97 | 0 | _o00_00o0000000000oo000o____ | 1,355.877 | 9,397.425 |
Admission time: Only caught the fact that 1.5 was faster accidentally because benchmarkdotnet defaulted to it and it did not match the CPS numbers. Then switched the version to 5 that matches internal usage and the above 10x differences are based off of that. :)
It's not going to be possible to have a meaningful improvement to the read access performance of ImmutableDictionary<TKey, TValue>
while it's backed by an AVL tree (i.e. there will be a significant performance difference in read-heavy scenarios between Dictionary<TKey, TValue>
and ImmutableDictionary<TKey, TValue>
). We recently implemented ImmutableSegmentedDictionary<TKey, TValue>
with the intent of allowing CPS to migrate away from the AVL-tree based collections. If mutations to the immutable collection are frequent and/or incremental (non-batched), eventually ImmutableTreeDictionary<TKey, TValue>
should provide a viable replacement but the current performance in read-heavy scenarios is only marginally better than ImmutableDictionary<TKey, TValue>
.
Thanks @sharwell It looks like there is some overlap with my FastImmutableDictionary implementation. My goal was to cover (a. Continue using ImmutableDictionary, but try to speed up with bug fixes like this one b. ImmutableDictionary that should not accept writes after it's been built (ImmutableSegmentedDictionary seems to achieve the same goal) c. Replace with a read-only collection instead since immutability is not really leveraged (ImmutableArrayDictionary I believe was your recommendation to target this.). Do you happen to have benchmarks for your collections that I can use to compare against my implementation and choose based on usage?
In the above case, the reason I believe there is still some room for improvement inspite of the AVL tree backing is because the slowness of the immutable.Get seems to come from many layers of method delegations as compared to the Dictionary's FindEntry. In the Immutable version, get_Item calls into TryGetValue, which then delegate to a static version of TryGetValue and the top two layers before the static call have very little logic, but they took more than 25% of the CPU time. Then, the static TryGetValue alone took more than 25% of the CPU time, while the overall time inside the two TryGetValue it calls is less than 15%.
The majority of the overhead is in function calls themselves, perhaps removing extra delagations or asking Compiler/JIT to do more method inlining is a viable alternative and would not involve any changes to the AVL tree itself.
@arkalyanms, I took a look at your benchmark, and noted you're running against netcoreapp2.1... that's old news :smile: Can you try something newer? Here's what I see for GetValue / 1.
Method | Runtime | Size | Mean |
---|---|---|---|
Dictionary | .NET 5.0 | 1 | 4.629 ns |
ImmutableDictionary | .NET 5.0 | 1 | 14.435 ns |
Dictionary | .NET Core 3.1 | 1 | 4.743 ns |
ImmutableDictionary | .NET Core 3.1 | 1 | 17.289 ns |
Dictionary | .NET Core 2.1 | 1 | 5.496 ns |
ImmutableDictionary | .NET Core 2.1 | 1 | 41.314 ns |
Ah nice! So there is a lot of goodness I am missing out on. I’ll update and circle back. Thanks @stephentoub
Alright, here are the net5.0 numbers. The delta is not as wide anymore (about 3-4x for Get and AddRange varies drastically based on data). For the cases that has data subject to change, I am inclined to switch over to a more custom ImmutableDictionary implementation I have or the TreeDictionary and to Dictionary if immutability is not required.
I am not sure if we still want to hold onto this bug to investigate some smaller oddities like the GetValue at size 1 (the increase in times with size shows there is some level of tree traversal cost, but there is also an additional overhead cost in ImmutableDictionary, much lower than in netcore2.1 but it's still there)
GetValue: Method | Size | Mean |
---|---|---|
Dictionary | 1 | 5.065 ns |
ImmutableDictionary | 1 | 15.744 ns |
Dictionary | 10 | 4.333 ns |
ImmutableDictionary | 10 | 18.325 ns |
Dictionary | 100 | 4.413 ns |
ImmutableDictionary | 100 | 20.234 ns |
Dictionary | 1000 | 4.637 ns |
ImmutableDictionary | 1000 | 20.448 ns |
CreateRange (Dictionary with constructor/DictionaryCreateRange and iterate over range and add/DictionaryWithCreateRange)
Method | InitialSize | AddSize | Mean | Allocated |
---|---|---|---|---|
DictionaryCreateRange | 0 | 1 | 65.02 ns | 80 B |
DictionaryWithCreateRange | 0 | 1 | 157.04 ns | 216 B |
**ImmutableDictionary | 0 | 1 | 167.61 ns | 136 B** |
DictionaryCreateRange | 0 | 10 | 38.08 ns | 80 B |
DictionaryWithCreateRange | 0 | 10 | 369.04 ns | 992 B |
**ImmutableDictionary | 0 | 10 | 1,408.63 ns | 712 B** |
DictionaryCreateRange | 0 | 100 | 35.19 ns | 80 B |
DictionaryWithCreateRange | 0 | 100 | 2,987.94 ns | 10192 B |
**ImmutableDictionary | 0 | 100 | 33,447.37 ns | 6472 B** |
DictionaryCreateRange | 0 | 1000 | 43.81 ns | 80 B |
DictionaryWithCreateRange | 0 | 1000 | 27,114.84 ns | 102216 B |
**ImmutableDictionary | 0 | 1000 | 463,872.26 ns | 64072 B** |
DictionaryCreateRange | 1 | 1 | 133.85 ns | 248 B |
DictionaryWithCreateRange | 1 | 1 | 154.85 ns | 248 B |
**ImmutableDictionary | 1 | 1 | 262.30 ns | 200 B** |
DictionaryCreateRange | 1 | 10 | 87.02 ns | 248 B |
DictionaryWithCreateRange | 1 | 10 | 320.02 ns | 1024 B |
**ImmutableDictionary | 1 | 10 | 1,626.24 ns | 776 B** |
DictionaryCreateRange | 1 | 100 | 76.13 ns | 248 B |
DictionaryWithCreateRange | 1 | 100 | 2,473.32 ns | 10224 B |
**ImmutableDictionary | 1 | 100 | 22,671.73 ns | 6536 B** |
DictionaryCreateRange | 1 | 1000 | 75.21 ns | 248 B |
DictionaryWithCreateRange | 1 | 1000 | 24,255.32 ns | 102248 B |
**ImmutableDictionary | 1 | 1000 | 379,588.42 ns | 64136 B** |
DictionaryCreateRange | 10 | 1 | 236.46 ns | 472 B |
DictionaryWithCreateRange | 10 | 1 | 281.06 ns | 472 B |
**ImmutableDictionary | 10 | 1 | 296.67 ns | 328 B** |
DictionaryCreateRange | 10 | 10 | 232.14 ns | 472 B |
DictionaryWithCreateRange | 10 | 10 | 502.75 ns | 1168 B |
**ImmutableDictionary | 10 | 10 | 2,035.24 ns | 904 B** |
DictionaryCreateRange | 10 | 100 | 228.71 ns | 472 B |
DictionaryWithCreateRange | 10 | 100 | 2,796.15 ns | 12328 B |
**ImmutableDictionary | 10 | 100 | 23,789.28 ns | 6664 B** |
DictionaryCreateRange | 10 | 1000 | 228.99 ns | 472 B |
DictionaryWithCreateRange | 10 | 1000 | 18,852.06 ns | 57904 B |
**ImmutableDictionary | 10 | 1000 | 372,247.07 ns | 64264 B** |
DictionaryCreateRange | 100 | 1 | 1,692.04 ns | 3160 B |
DictionaryWithCreateRange | 100 | 1 | 2,014.89 ns | 3160 B |
**ImmutableDictionary | 100 | 1 | 550.51 ns | 584 B** |
DictionaryCreateRange | 100 | 10 | 1,688.81 ns | 3160 B |
DictionaryWithCreateRange | 100 | 10 | 2,816.81 ns | 9904 B |
**ImmutableDictionary | 100 | 10 | 2,819.03 ns | 1160 B** |
DictionaryCreateRange | 100 | 100 | 1,723.77 ns | 3160 B |
DictionaryWithCreateRange | 100 | 100 | 3,806.65 ns | 9904 B |
**ImmutableDictionary | 100 | 100 | 28,910.46 ns | 6920 B** |
DictionaryCreateRange | 100 | 1000 | 1,671.09 ns | 3160 B |
DictionaryWithCreateRange | 100 | 1000 | 19,124.78 ns | 55480 B |
**ImmutableDictionary | 100 | 1000 | 379,376.33 ns | 64520 B** |
DictionaryCreateRange | 1000 | 1 | 15,935.85 ns | 31048 B |
DictionaryWithCreateRange | 1000 | 1 | 19,318.62 ns | 31048 B |
**ImmutableDictionary | 1000 | 1 | 687.31 ns | 776 B** |
DictionaryCreateRange | 1000 | 10 | 15,958.08 ns | 31048 B |
DictionaryWithCreateRange | 1000 | 10 | 18,415.86 ns | 31048 B |
**ImmutableDictionary | 1000 | 10 | 3,686.45 ns | 1352 B** |
DictionaryCreateRange | 1000 | 100 | 15,956.17 ns | 31048 B |
DictionaryWithCreateRange | 1000 | 100 | 19,223.19 ns | 31048 B |
**ImmutableDictionary | 1000 | 100 | 37,251.51 ns | 7112 B** |
DictionaryCreateRange | 1000 | 1000 | 15,991.76 ns | 31048 B |
DictionaryWithCreateRange | 1000 | 1000 | 35,218.55 ns | 96424 B |
**ImmutableDictionary | 1000 | 1000 | 433,827.04 ns | 64712 B** |
Honestly these numbers don't seem particularly surprising to me. There have been proposals like the one in #14477 for replacing the AVL tree implementation with something faster.
I'm going to close this issue, since it is not very actionable. I would recommend continuing the conversation in #14477 on potential perf improvements to the internal implementation of immutable hashtables.
I am on the CPS team in Visual Studio and are benchmarking bottlenecks that affect solution load and a big contributor to the slowdown is Immutable collections. Instead of switching out the collection I started benchmarking across versions of the ImmutableDictionary for example and there is a steady regression going all the way back to version 1.5 of the dictionary
Yes, I understand Immutable collections will be slower, but there are clear indications of a perf bug here, especially when you look at the GetValue numbers.
Some benchmark numbers below:
Benchmark repo here - https://devdiv.visualstudio.com/DefaultCollection/Personal/_git/arkalyan
Adding more thoughts from folks who have observed a similar slowness: The slowness of the immutable.Get actually comes from many layers of method delegations.
In the Dictionary<TKey, TValue>, get_Item is almost a single method (only calls FindEntry), TryGetValue has a similar but separated implementation.
On the other side, in the Immutable version, get_Item calls into TryGetValue, which then delegate to a static version of TryGetValue. The top two layers have very little logic, but they took more than 25% of the CPU time.
Then, the static TryGetValue alone took more than 25% of the CPU time, while the overall time inside two TryGetValue it calls is less than 15%. I think due to the core of this logic is so small, the majority overhead is function calls themselves. Maybe one thing to try is to push more functions inline, but either removing some extra delegating, or ask Compiler/JIT to do more method inlining?
http://index/?leftProject=System.Collections.Immutable&leftSymbol=hghocbfkyomt&file=System%5CCollections%5CImmutable%5CImmutableDictionary_2.cs