dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

ImmutableDictionary has slowly been regressing and in version 5.0 is about 10X slower than Dictionary #47812

Closed arunchndr closed 3 years ago

arunchndr commented 3 years ago

I am on the CPS team in Visual Studio and are benchmarking bottlenecks that affect solution load and a big contributor to the slowdown is Immutable collections. Instead of switching out the collection I started benchmarking across versions of the ImmutableDictionary for example and there is a steady regression going all the way back to version 1.5 of the dictionary

Yes, I understand Immutable collections will be slower, but there are clear indications of a perf bug here, especially when you look at the GetValue numbers.

Some benchmark numbers below:

GetValue: Method Size Mean
Dictionary 1 6ns
ImmutableDictionary 1 48ns
Dictionary 10 6ns
ImmutableDictionary 10 50ns
Dictionary 100 6ns
ImmutableDictionary 100 58ns
Dictionary 1000 6ns
ImmutableDictionary 1000 58ns
Dictionary 10000 6ns
ImmutableDictionary 10000 77ns
AddRange (SetRange is pretty similar) Method InitialSize AddSize Mean
       
Dictionary 10 1 9ns
ImmutableDictionary 10 1 570ns
Dictionary 10 10 212ns
ImmutableDictionary 10 10 3559ns
Dictionary 10 100 2412ns
ImmutableDictionary 10 100 41525ns
Dictionary 100 1 9ns
ImmutableDictionary 100 1 1040ns
Dictionary 100 10 920ns
ImmutableDictionary 100 10 5113ns
Dictionary 100 100 1810ns
ImmutableDictionary 100 100 49665ns
Dictionary 1000 1 9ns
ImmutableDictionary 1000 1 1392ns
Dictionary 1000 100 820ns
ImmutableDictionary 1000 100 61750ns

Benchmark repo here - https://devdiv.visualstudio.com/DefaultCollection/Personal/_git/arkalyan

Adding more thoughts from folks who have observed a similar slowness: The slowness of the immutable.Get actually comes from many layers of method delegations.

In the Dictionary<TKey, TValue>, get_Item is almost a single method (only calls FindEntry), TryGetValue has a similar but separated implementation.

On the other side, in the Immutable version, get_Item calls into TryGetValue, which then delegate to a static version of TryGetValue. The top two layers have very little logic, but they took more than 25% of the CPU time.

Then, the static TryGetValue alone took more than 25% of the CPU time, while the overall time inside two TryGetValue it calls is less than 15%. I think due to the core of this logic is so small, the majority overhead is function calls themselves. Maybe one thing to try is to push more functions inline, but either removing some extra delegating, or ask Compiler/JIT to do more method inlining?

http://index/?leftProject=System.Collections.Immutable&leftSymbol=hghocbfkyomt&file=System%5CCollections%5CImmutable%5CImmutableDictionary_2.cs

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @eiriktsarpalis See info in area-owners.md if you want to be subscribed.

Issue Details
I am on the CPS team in Visual Studio and are benchmarking bottlenecks that [affect solution load](https://dev.azure.com/devdiv/DevDiv/_wiki/wikis/DevDiv.wiki/19818/Speed-up-netcore-solution-load) and a big contributor to the slowdown is Immutable collections. Instead of switching out the collection I started benchmarking across versions of the ImmutableDictionary for example and there is a steady regression going all the way back to version 1.5 of the dictionary Yes, I understand Immutable collections will be slower, but there are clear indications of a perf bug here, especially when you look at the GetValue numbers. Some benchmark numbers below: **GetValue:** Method | Size | Mean -- | -- | -- Dictionary | 1 | 6ns ImmutableDictionary | 1 | 48ns Dictionary | 10 | 6ns ImmutableDictionary | 10 | 50ns Dictionary | 100 | 6ns ImmutableDictionary | 100 | 58ns Dictionary | 1000 | 6ns ImmutableDictionary | 1000 | 58ns Dictionary | 10000 | 6ns ImmutableDictionary | 10000 | 77ns **AddRange (SetRange is pretty similar)** Method | InitialSize | AddSize | Mean -- | -- | -- | --   |   |   |   Dictionary | 10 | 1 | 9ns ImmutableDictionary | 10 | 1 | 570ns Dictionary | 10 | 10 | 212ns ImmutableDictionary | 10 | 10 | 3559ns Dictionary | 10 | 100 | 2412ns ImmutableDictionary | 10 | 100 | 41525ns Dictionary | 100 | 1 | 9ns ImmutableDictionary | 100 | 1 | 1040ns Dictionary | 100 | 10 | 920ns ImmutableDictionary | 100 | 10 | 5113ns Dictionary | 100 | 100 | 1810ns ImmutableDictionary | 100 | 100 | 49665ns Dictionary | 1000 | 1 | 9ns ImmutableDictionary | 1000 | 1 | 1392ns Dictionary | 1000 | 100 | 820ns ImmutableDictionary | 1000 | 100 | 61750ns Benchmark repo here - https://devdiv.visualstudio.com/DefaultCollection/Personal/_git/arkalyan
Author: arkalyanms
Assignees: -
Labels: `area-System.Collections`, `untriaged`
Milestone: -
stephentoub commented 3 years ago

especially when you look at the GetValue numbers

Dictionary uses a hash map and lookup is O(1). ImmutableDictionary uses a tree and lookup is O(log N).

there is a steady regression going all the way back to version 1.5 of the dictionary

You mean the same operation with the same inputs has been getting slower and slower with each version of the nuget package?

stephentoub commented 3 years ago

cc: @adamsitnik

arunchndr commented 3 years ago

Dictionary uses a hash map and lookup is O(1). ImmutableDictionary uses a tree and lookup is O(log N)

I understand, but lets compare the Dictionary and the Tree(ImmutableDictionary) states with 1 value loaded in them, so the tree doesn't have to look beyond the root. It still compared as 6ns vs 48ns respectively.

You mean the same operation with the same inputs has been getting slower and slower with each version of the nuget package?

Yes, I picked version 1.5 to run the same benchmarks and values were quite a bit lower. Below is a perf trace comparison of the 2 versions with similar data:

Version 5.0: Name Exc % Exc Inc % Inc Fold When First Last
||     |         |+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].get_Item(!0) 9.8 749 71.6 5,469 0 _18888888787878888878877877761__ 260.353 7,053.892
**||     |         ||+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,!1&) 16.3 1,246 61.1 4,668 0 _17777677677667667676766666651__ 260.353 7,053.892**
||     |         |||+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,value class MutationInput,!1&) 22.8 1,739 38.7 2,953 0 _24444444444444444444444444431__ 263.329 7,053.892
||     |         ||||+ system.collections.immutable!System.Collections.Immutable.SortedInt32KeyNode1[System.Collections.Immutable.ImmutableDictionary2+HashBucket[System.Int32,System.__Canon]].TryGetValue(int32,!0&) 5.9 450 7.4 566 0 _00011000100000000000000100100__ 274.230 7,045.933
||     |         |||||+ coreclr!? 1.5 112 1.5 113 0 _0000000_00oo00_0o0oo00000o00o__ 325.751 7,015.911
||     |         ||||||+ ntoskrnl!? 0.0 1 0.0 1 0 _____o__ 3,192.499 3,193.499
||     |         |||||+ ntoskrnl!? 0.0 2 0.0 3 0 ____o_oo 4,883.337 6,700.499
||     |         ||||+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.Int32,System.__Canon].TryGetValue(!0,class Comparers,!1&) 5.9 453 7.0 531 0 _00000000000000010000000000000__ 263.329 7,050.929

Version 1.5

Name Exc % Exc Inc % Inc Fold When First Last
||     |     |  + BenchmarkDotNet!Engine.RunIteration 0.0 0 65.2 7,523 0 _099999899989999999999980___ 1,337.406 9,566.603
||     |     |   + 709f9c6b-ef0d-4179-9d68-e29fe7348f34!Runnable_1.WorkloadActionUnroll 2.2 251 65.1 7,513 0 _099999899989999998999980___ 1,337.406 9,566.603
||     |     |   |+ system.collections.immutable!System.SR..cctor() 0.8 93 58.9 6,801 0 _o88888888888888888888870___ 1,337.406 9,566.603
||     |     |   ||+ system.collections.immutable!SecureObjectPool.NewId 8.4 975 58.1 6,708 0 _088888788888888887878870___ 1,337.406 9,566.603
**||     |     |   |+ immutabledictionary.benchmark!GetValue.ImmutableDictionary 2.8 326 2.8 326 0 _00000000000000000000000o___ 1,351.878 9,533.597**
||     |     |   |+ coreclr!? 0.8 97 0.8 97 0 _o00_00o0000000000oo000o____ 1,355.877 9,397.425
arunchndr commented 3 years ago

Admission time: Only caught the fact that 1.5 was faster accidentally because benchmarkdotnet defaulted to it and it did not match the CPS numbers. Then switched the version to 5 that matches internal usage and the above 10x differences are based off of that. :)

sharwell commented 3 years ago

It's not going to be possible to have a meaningful improvement to the read access performance of ImmutableDictionary<TKey, TValue> while it's backed by an AVL tree (i.e. there will be a significant performance difference in read-heavy scenarios between Dictionary<TKey, TValue> and ImmutableDictionary<TKey, TValue>). We recently implemented ImmutableSegmentedDictionary<TKey, TValue> with the intent of allowing CPS to migrate away from the AVL-tree based collections. If mutations to the immutable collection are frequent and/or incremental (non-batched), eventually ImmutableTreeDictionary<TKey, TValue> should provide a viable replacement but the current performance in read-heavy scenarios is only marginally better than ImmutableDictionary<TKey, TValue>.

arunchndr commented 3 years ago

Thanks @sharwell It looks like there is some overlap with my FastImmutableDictionary implementation. My goal was to cover (a. Continue using ImmutableDictionary, but try to speed up with bug fixes like this one b. ImmutableDictionary that should not accept writes after it's been built (ImmutableSegmentedDictionary seems to achieve the same goal) c. Replace with a read-only collection instead since immutability is not really leveraged (ImmutableArrayDictionary I believe was your recommendation to target this.). Do you happen to have benchmarks for your collections that I can use to compare against my implementation and choose based on usage?

In the above case, the reason I believe there is still some room for improvement inspite of the AVL tree backing is because the slowness of the immutable.Get seems to come from many layers of method delegations as compared to the Dictionary's FindEntry. In the Immutable version, get_Item calls into TryGetValue, which then delegate to a static version of TryGetValue and the top two layers before the static call have very little logic, but they took more than 25% of the CPU time. Then, the static TryGetValue alone took more than 25% of the CPU time, while the overall time inside the two TryGetValue it calls is less than 15%.

The majority of the overhead is in function calls themselves, perhaps removing extra delagations or asking Compiler/JIT to do more method inlining is a viable alternative and would not involve any changes to the AVL tree itself.

stephentoub commented 3 years ago

@arkalyanms, I took a look at your benchmark, and noted you're running against netcoreapp2.1... that's old news :smile: Can you try something newer? Here's what I see for GetValue / 1.

Method Runtime Size Mean
Dictionary .NET 5.0 1 4.629 ns
ImmutableDictionary .NET 5.0 1 14.435 ns
Dictionary .NET Core 3.1 1 4.743 ns
ImmutableDictionary .NET Core 3.1 1 17.289 ns
Dictionary .NET Core 2.1 1 5.496 ns
ImmutableDictionary .NET Core 2.1 1 41.314 ns
arunchndr commented 3 years ago

Ah nice! So there is a lot of goodness I am missing out on. I’ll update and circle back. Thanks @stephentoub

arunchndr commented 3 years ago

Alright, here are the net5.0 numbers. The delta is not as wide anymore (about 3-4x for Get and AddRange varies drastically based on data). For the cases that has data subject to change, I am inclined to switch over to a more custom ImmutableDictionary implementation I have or the TreeDictionary and to Dictionary if immutability is not required.

I am not sure if we still want to hold onto this bug to investigate some smaller oddities like the GetValue at size 1 (the increase in times with size shows there is some level of tree traversal cost, but there is also an additional overhead cost in ImmutableDictionary, much lower than in netcore2.1 but it's still there)

GetValue: Method Size Mean
Dictionary 1 5.065 ns
ImmutableDictionary 1 15.744 ns
Dictionary 10 4.333 ns
ImmutableDictionary 10 18.325 ns
Dictionary 100 4.413 ns
ImmutableDictionary 100 20.234 ns
Dictionary 1000 4.637 ns
ImmutableDictionary 1000 20.448 ns

CreateRange (Dictionary with constructor/DictionaryCreateRange and iterate over range and add/DictionaryWithCreateRange)

Method InitialSize AddSize Mean Allocated
DictionaryCreateRange 0 1 65.02 ns 80 B
DictionaryWithCreateRange 0 1 157.04 ns 216 B
**ImmutableDictionary 0 1 167.61 ns 136 B**
DictionaryCreateRange 0 10 38.08 ns 80 B
DictionaryWithCreateRange 0 10 369.04 ns 992 B
**ImmutableDictionary 0 10 1,408.63 ns 712 B**
DictionaryCreateRange 0 100 35.19 ns 80 B
DictionaryWithCreateRange 0 100 2,987.94 ns 10192 B
**ImmutableDictionary 0 100 33,447.37 ns 6472 B**
DictionaryCreateRange 0 1000 43.81 ns 80 B
DictionaryWithCreateRange 0 1000 27,114.84 ns 102216 B
**ImmutableDictionary 0 1000 463,872.26 ns 64072 B**
DictionaryCreateRange 1 1 133.85 ns 248 B
DictionaryWithCreateRange 1 1 154.85 ns 248 B
**ImmutableDictionary 1 1 262.30 ns 200 B**
DictionaryCreateRange 1 10 87.02 ns 248 B
DictionaryWithCreateRange 1 10 320.02 ns 1024 B
**ImmutableDictionary 1 10 1,626.24 ns 776 B**
DictionaryCreateRange 1 100 76.13 ns 248 B
DictionaryWithCreateRange 1 100 2,473.32 ns 10224 B
**ImmutableDictionary 1 100 22,671.73 ns 6536 B**
DictionaryCreateRange 1 1000 75.21 ns 248 B
DictionaryWithCreateRange 1 1000 24,255.32 ns 102248 B
**ImmutableDictionary 1 1000 379,588.42 ns 64136 B**
DictionaryCreateRange 10 1 236.46 ns 472 B
DictionaryWithCreateRange 10 1 281.06 ns 472 B
**ImmutableDictionary 10 1 296.67 ns 328 B**
DictionaryCreateRange 10 10 232.14 ns 472 B
DictionaryWithCreateRange 10 10 502.75 ns 1168 B
**ImmutableDictionary 10 10 2,035.24 ns 904 B**
DictionaryCreateRange 10 100 228.71 ns 472 B
DictionaryWithCreateRange 10 100 2,796.15 ns 12328 B
**ImmutableDictionary 10 100 23,789.28 ns 6664 B**
DictionaryCreateRange 10 1000 228.99 ns 472 B
DictionaryWithCreateRange 10 1000 18,852.06 ns 57904 B
**ImmutableDictionary 10 1000 372,247.07 ns 64264 B**
DictionaryCreateRange 100 1 1,692.04 ns 3160 B
DictionaryWithCreateRange 100 1 2,014.89 ns 3160 B
**ImmutableDictionary 100 1 550.51 ns 584 B**
DictionaryCreateRange 100 10 1,688.81 ns 3160 B
DictionaryWithCreateRange 100 10 2,816.81 ns 9904 B
**ImmutableDictionary 100 10 2,819.03 ns 1160 B**
DictionaryCreateRange 100 100 1,723.77 ns 3160 B
DictionaryWithCreateRange 100 100 3,806.65 ns 9904 B
**ImmutableDictionary 100 100 28,910.46 ns 6920 B**
DictionaryCreateRange 100 1000 1,671.09 ns 3160 B
DictionaryWithCreateRange 100 1000 19,124.78 ns 55480 B
**ImmutableDictionary 100 1000 379,376.33 ns 64520 B**
DictionaryCreateRange 1000 1 15,935.85 ns 31048 B
DictionaryWithCreateRange 1000 1 19,318.62 ns 31048 B
**ImmutableDictionary 1000 1 687.31 ns 776 B**
DictionaryCreateRange 1000 10 15,958.08 ns 31048 B
DictionaryWithCreateRange 1000 10 18,415.86 ns 31048 B
**ImmutableDictionary 1000 10 3,686.45 ns 1352 B**
DictionaryCreateRange 1000 100 15,956.17 ns 31048 B
DictionaryWithCreateRange 1000 100 19,223.19 ns 31048 B
**ImmutableDictionary 1000 100 37,251.51 ns 7112 B**
DictionaryCreateRange 1000 1000 15,991.76 ns 31048 B
DictionaryWithCreateRange 1000 1000 35,218.55 ns 96424 B
**ImmutableDictionary 1000 1000 433,827.04 ns 64712 B**
eiriktsarpalis commented 3 years ago

Honestly these numbers don't seem particularly surprising to me. There have been proposals like the one in #14477 for replacing the AVL tree implementation with something faster.

eiriktsarpalis commented 3 years ago

I'm going to close this issue, since it is not very actionable. I would recommend continuing the conversation in #14477 on potential perf improvements to the internal implementation of immutable hashtables.