ImmutableDictionary has slowly been regressing and in version 5.0 is about 10X slower than Dictionary

arunchndr commented 3 years ago

I am on the CPS team in Visual Studio and are benchmarking bottlenecks that affect solution load and a big contributor to the slowdown is Immutable collections. Instead of switching out the collection I started benchmarking across versions of the ImmutableDictionary for example and there is a steady regression going all the way back to version 1.5 of the dictionary

Yes, I understand Immutable collections will be slower, but there are clear indications of a perf bug here, especially when you look at the GetValue numbers.

Some benchmark numbers below:

GetValue: Method	Size	Mean
Dictionary	1	6ns
ImmutableDictionary	1	48ns
Dictionary	10	6ns
ImmutableDictionary	10	50ns
Dictionary	100	6ns
ImmutableDictionary	100	58ns
Dictionary	1000	6ns
ImmutableDictionary	1000	58ns
Dictionary	10000	6ns
ImmutableDictionary	10000	77ns

AddRange (SetRange is pretty similar) Method	InitialSize	AddSize	Mean

Dictionary	10	1	9ns
ImmutableDictionary	10	1	570ns
Dictionary	10	10	212ns
ImmutableDictionary	10	10	3559ns
Dictionary	10	100	2412ns
ImmutableDictionary	10	100	41525ns
Dictionary	100	1	9ns
ImmutableDictionary	100	1	1040ns
Dictionary	100	10	920ns
ImmutableDictionary	100	10	5113ns
Dictionary	100	100	1810ns
ImmutableDictionary	100	100	49665ns
Dictionary	1000	1	9ns
ImmutableDictionary	1000	1	1392ns
Dictionary	1000	100	820ns
ImmutableDictionary	1000	100	61750ns

Benchmark repo here - https://devdiv.visualstudio.com/DefaultCollection/Personal/_git/arkalyan

Adding more thoughts from folks who have observed a similar slowness: The slowness of the immutable.Get actually comes from many layers of method delegations.

In the Dictionary<TKey, TValue>, get_Item is almost a single method (only calls FindEntry), TryGetValue has a similar but separated implementation.

On the other side, in the Immutable version, get_Item calls into TryGetValue, which then delegate to a static version of TryGetValue. The top two layers have very little logic, but they took more than 25% of the CPU time.

Then, the static TryGetValue alone took more than 25% of the CPU time, while the overall time inside two TryGetValue it calls is less than 15%. I think due to the core of this logic is so small, the majority overhead is function calls themselves. Maybe one thing to try is to push more functions inline, but either removing some extra delegating, or ask Compiler/JIT to do more method inlining?

http://index/?leftProject=System.Collections.Immutable&leftSymbol=hghocbfkyomt&file=System%5CCollections%5CImmutable%5CImmutableDictionary_2.cs

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 3 years ago

Tagging subscribers to this area: @eiriktsarpalis See info in area-owners.md if you want to be subscribed.

Issue Details

I am on the CPS team in Visual Studio and are benchmarking bottlenecks that [affect solution load](https://dev.azure.com/devdiv/DevDiv/_wiki/wikis/DevDiv.wiki/19818/Speed-up-netcore-solution-load) and a big contributor to the slowdown is Immutable collections. Instead of switching out the collection I started benchmarking across versions of the ImmutableDictionary for example and there is a steady regression going all the way back to version 1.5 of the dictionary Yes, I understand Immutable collections will be slower, but there are clear indications of a perf bug here, especially when you look at the GetValue numbers. Some benchmark numbers below: **GetValue:** Method | Size | Mean -- | -- | -- Dictionary | 1 | 6ns ImmutableDictionary | 1 | 48ns Dictionary | 10 | 6ns ImmutableDictionary | 10 | 50ns Dictionary | 100 | 6ns ImmutableDictionary | 100 | 58ns Dictionary | 1000 | 6ns ImmutableDictionary | 1000 | 58ns Dictionary | 10000 | 6ns ImmutableDictionary | 10000 | 77ns **AddRange (SetRange is pretty similar)** Method | InitialSize | AddSize | Mean -- | -- | -- | -- | | | Dictionary | 10 | 1 | 9ns ImmutableDictionary | 10 | 1 | 570ns Dictionary | 10 | 10 | 212ns ImmutableDictionary | 10 | 10 | 3559ns Dictionary | 10 | 100 | 2412ns ImmutableDictionary | 10 | 100 | 41525ns Dictionary | 100 | 1 | 9ns ImmutableDictionary | 100 | 1 | 1040ns Dictionary | 100 | 10 | 920ns ImmutableDictionary | 100 | 10 | 5113ns Dictionary | 100 | 100 | 1810ns ImmutableDictionary | 100 | 100 | 49665ns Dictionary | 1000 | 1 | 9ns ImmutableDictionary | 1000 | 1 | 1392ns Dictionary | 1000 | 100 | 820ns ImmutableDictionary | 1000 | 100 | 61750ns Benchmark repo here - https://devdiv.visualstudio.com/DefaultCollection/Personal/_git/arkalyan

Author:	arkalyanms
Assignees:	-
Labels:	`area-System.Collections`, `untriaged`
Milestone:	-

stephentoub commented 3 years ago

especially when you look at the GetValue numbers

Dictionary uses a hash map and lookup is O(1). ImmutableDictionary uses a tree and lookup is O(log N).

there is a steady regression going all the way back to version 1.5 of the dictionary

You mean the same operation with the same inputs has been getting slower and slower with each version of the nuget package?

stephentoub commented 3 years ago

cc: @adamsitnik

arunchndr commented 3 years ago

Dictionary uses a hash map and lookup is O(1). ImmutableDictionary uses a tree and lookup is O(log N)

I understand, but lets compare the Dictionary and the Tree(ImmutableDictionary) states with 1 value loaded in them, so the tree doesn't have to look beyond the root. It still compared as 6ns vs 48ns respectively.

You mean the same operation with the same inputs has been getting slower and slower with each version of the nuget package?

Yes, I picked version 1.5 to run the same benchmarks and values were quite a bit lower. Below is a perf trace comparison of the 2 versions with similar data:

Version 5.0: Name	Exc %	Exc	Inc %	Inc	When	First	Last
\|\| \| \|+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].get_Item(!0)	9.8	749	71.6	5,469	_18888888787878888878877877761__	260.353	7,053.892
**\|\| \| \|\|+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,!1&)	16.3	1,246	61.1	4,668	_17777677677667667676766666651__	260.353	7,053.892**
\|\| \| \|\|\|+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2[System.Int32,System.__Canon].TryGetValue(!0,value class MutationInput,!1&)	22.8	1,739	38.7	2,953	_24444444444444444444444444431__	263.329	7,053.892
\|\| \| \|\|\|\|+ system.collections.immutable!System.Collections.Immutable.SortedInt32KeyNode`1[System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.Int32,System.__Canon]].TryGetValue(int32,!0&)	5.9	450	7.4	566	_00011000100000000000000100100__	274.230	7,045.933
\|\| \| \|\|\|\|\|+ coreclr!?	1.5	112	1.5	113	_0000000_00oo00_0o0oo00000o00o__	325.751	7,015.911
\|\| \| \|\|\|\|\|\|+ ntoskrnl!?	0.0	1	0.0	1	_____o__	3,192.499	3,193.499
\|\| \| \|\|\|\|\|+ ntoskrnl!?	0.0	2	0.0	3	____o_oo	4,883.337	6,700.499
\|\| \| \|\|\|\|+ system.collections.immutable!System.Collections.Immutable.ImmutableDictionary`2+HashBucket[System.Int32,System.__Canon].TryGetValue(!0,class Comparers,!1&)	5.9	453	7.0	531	_00000000000000010000000000000__	263.329	7,050.929

Version 1.5

Name	Exc %	Exc	Inc %	Inc	When	First	Last
\|\| \| \| + BenchmarkDotNet!Engine.RunIteration	0.0	0	65.2	7,523	_099999899989999999999980___	1,337.406	9,566.603
\|\| \| \| + 709f9c6b-ef0d-4179-9d68-e29fe7348f34!Runnable_1.WorkloadActionUnroll	2.2	251	65.1	7,513	_099999899989999998999980___	1,337.406	9,566.603
\|\| \| \| \|+ system.collections.immutable!System.SR..cctor()	0.8	93	58.9	6,801	_o88888888888888888888870___	1,337.406	9,566.603
\|\| \| \| \|\|+ system.collections.immutable!SecureObjectPool.NewId	8.4	975	58.1	6,708	_088888788888888887878870___	1,337.406	9,566.603
**\|\| \| \| \|+ immutabledictionary.benchmark!GetValue.ImmutableDictionary	2.8	326	2.8	326	_00000000000000000000000o___	1,351.878	9,533.597**
\|\| \| \| \|+ coreclr!?	0.8	97	0.8	97	_o00_00o0000000000oo000o____	1,355.877	9,397.425

arunchndr commented 3 years ago

Admission time: Only caught the fact that 1.5 was faster accidentally because benchmarkdotnet defaulted to it and it did not match the CPS numbers. Then switched the version to 5 that matches internal usage and the above 10x differences are based off of that. :)

sharwell commented 3 years ago

It's not going to be possible to have a meaningful improvement to the read access performance of ImmutableDictionary<TKey, TValue> while it's backed by an AVL tree (i.e. there will be a significant performance difference in read-heavy scenarios between Dictionary<TKey, TValue> and ImmutableDictionary<TKey, TValue>). We recently implemented ImmutableSegmentedDictionary<TKey, TValue> with the intent of allowing CPS to migrate away from the AVL-tree based collections. If mutations to the immutable collection are frequent and/or incremental (non-batched), eventually ImmutableTreeDictionary<TKey, TValue> should provide a viable replacement but the current performance in read-heavy scenarios is only marginally better than ImmutableDictionary<TKey, TValue>.

arunchndr commented 3 years ago

Thanks @sharwell It looks like there is some overlap with my FastImmutableDictionary implementation. My goal was to cover (a. Continue using ImmutableDictionary, but try to speed up with bug fixes like this one b. ImmutableDictionary that should not accept writes after it's been built (ImmutableSegmentedDictionary seems to achieve the same goal) c. Replace with a read-only collection instead since immutability is not really leveraged (ImmutableArrayDictionary I believe was your recommendation to target this.). Do you happen to have benchmarks for your collections that I can use to compare against my implementation and choose based on usage?

In the above case, the reason I believe there is still some room for improvement inspite of the AVL tree backing is because the slowness of the immutable.Get seems to come from many layers of method delegations as compared to the Dictionary's FindEntry. In the Immutable version, get_Item calls into TryGetValue, which then delegate to a static version of TryGetValue and the top two layers before the static call have very little logic, but they took more than 25% of the CPU time. Then, the static TryGetValue alone took more than 25% of the CPU time, while the overall time inside the two TryGetValue it calls is less than 15%.

The majority of the overhead is in function calls themselves, perhaps removing extra delagations or asking Compiler/JIT to do more method inlining is a viable alternative and would not involve any changes to the AVL tree itself.

stephentoub commented 3 years ago

@arkalyanms, I took a look at your benchmark, and noted you're running against netcoreapp2.1... that's old news :smile: Can you try something newer? Here's what I see for GetValue / 1.

Method	Runtime	Size	Mean
Dictionary	.NET 5.0	1	4.629 ns
ImmutableDictionary	.NET 5.0	1	14.435 ns
Dictionary	.NET Core 3.1	1	4.743 ns
ImmutableDictionary	.NET Core 3.1	1	17.289 ns
Dictionary	.NET Core 2.1	1	5.496 ns
ImmutableDictionary	.NET Core 2.1	1	41.314 ns

arunchndr commented 3 years ago

Ah nice! So there is a lot of goodness I am missing out on. I’ll update and circle back. Thanks @stephentoub

arunchndr commented 3 years ago

Alright, here are the net5.0 numbers. The delta is not as wide anymore (about 3-4x for Get and AddRange varies drastically based on data). For the cases that has data subject to change, I am inclined to switch over to a more custom ImmutableDictionary implementation I have or the TreeDictionary and to Dictionary if immutability is not required.

I am not sure if we still want to hold onto this bug to investigate some smaller oddities like the GetValue at size 1 (the increase in times with size shows there is some level of tree traversal cost, but there is also an additional overhead cost in ImmutableDictionary, much lower than in netcore2.1 but it's still there)

GetValue: Method	Size	Mean
Dictionary	1	5.065 ns
ImmutableDictionary	1	15.744 ns
Dictionary	10	4.333 ns
ImmutableDictionary	10	18.325 ns
Dictionary	100	4.413 ns
ImmutableDictionary	100	20.234 ns
Dictionary	1000	4.637 ns
ImmutableDictionary	1000	20.448 ns

CreateRange (Dictionary with constructor/DictionaryCreateRange and iterate over range and add/DictionaryWithCreateRange)

Method	InitialSize	AddSize	Mean	Allocated
DictionaryCreateRange	0	1	65.02 ns	80 B
DictionaryWithCreateRange	0	1	157.04 ns	216 B
**ImmutableDictionary	0	1	167.61 ns	136 B**
DictionaryCreateRange	0	10	38.08 ns	80 B
DictionaryWithCreateRange	0	10	369.04 ns	992 B
**ImmutableDictionary	0	10	1,408.63 ns	712 B**
DictionaryCreateRange	0	100	35.19 ns	80 B
DictionaryWithCreateRange	0	100	2,987.94 ns	10192 B
**ImmutableDictionary	0	100	33,447.37 ns	6472 B**
DictionaryCreateRange	0	1000	43.81 ns	80 B
DictionaryWithCreateRange	0	1000	27,114.84 ns	102216 B
**ImmutableDictionary	0	1000	463,872.26 ns	64072 B**
DictionaryCreateRange	1	1	133.85 ns	248 B
DictionaryWithCreateRange	1	1	154.85 ns	248 B
**ImmutableDictionary	1	1	262.30 ns	200 B**
DictionaryCreateRange	1	10	87.02 ns	248 B
DictionaryWithCreateRange	1	10	320.02 ns	1024 B
**ImmutableDictionary	1	10	1,626.24 ns	776 B**
DictionaryCreateRange	1	100	76.13 ns	248 B
DictionaryWithCreateRange	1	100	2,473.32 ns	10224 B
**ImmutableDictionary	1	100	22,671.73 ns	6536 B**
DictionaryCreateRange	1	1000	75.21 ns	248 B
DictionaryWithCreateRange	1	1000	24,255.32 ns	102248 B
**ImmutableDictionary	1	1000	379,588.42 ns	64136 B**
DictionaryCreateRange	10	1	236.46 ns	472 B
DictionaryWithCreateRange	10	1	281.06 ns	472 B
**ImmutableDictionary	10	1	296.67 ns	328 B**
DictionaryCreateRange	10	10	232.14 ns	472 B
DictionaryWithCreateRange	10	10	502.75 ns	1168 B
**ImmutableDictionary	10	10	2,035.24 ns	904 B**
DictionaryCreateRange	10	100	228.71 ns	472 B
DictionaryWithCreateRange	10	100	2,796.15 ns	12328 B
**ImmutableDictionary	10	100	23,789.28 ns	6664 B**
DictionaryCreateRange	10	1000	228.99 ns	472 B
DictionaryWithCreateRange	10	1000	18,852.06 ns	57904 B
**ImmutableDictionary	10	1000	372,247.07 ns	64264 B**
DictionaryCreateRange	100	1	1,692.04 ns	3160 B
DictionaryWithCreateRange	100	1	2,014.89 ns	3160 B
**ImmutableDictionary	100	1	550.51 ns	584 B**
DictionaryCreateRange	100	10	1,688.81 ns	3160 B
DictionaryWithCreateRange	100	10	2,816.81 ns	9904 B
**ImmutableDictionary	100	10	2,819.03 ns	1160 B**
DictionaryCreateRange	100	100	1,723.77 ns	3160 B
DictionaryWithCreateRange	100	100	3,806.65 ns	9904 B
**ImmutableDictionary	100	100	28,910.46 ns	6920 B**
DictionaryCreateRange	100	1000	1,671.09 ns	3160 B
DictionaryWithCreateRange	100	1000	19,124.78 ns	55480 B
**ImmutableDictionary	100	1000	379,376.33 ns	64520 B**
DictionaryCreateRange	1000	1	15,935.85 ns	31048 B
DictionaryWithCreateRange	1000	1	19,318.62 ns	31048 B
**ImmutableDictionary	1000	1	687.31 ns	776 B**
DictionaryCreateRange	1000	10	15,958.08 ns	31048 B
DictionaryWithCreateRange	1000	10	18,415.86 ns	31048 B
**ImmutableDictionary	1000	10	3,686.45 ns	1352 B**
DictionaryCreateRange	1000	100	15,956.17 ns	31048 B
DictionaryWithCreateRange	1000	100	19,223.19 ns	31048 B
**ImmutableDictionary	1000	100	37,251.51 ns	7112 B**
DictionaryCreateRange	1000	1000	15,991.76 ns	31048 B
DictionaryWithCreateRange	1000	1000	35,218.55 ns	96424 B
**ImmutableDictionary	1000	1000	433,827.04 ns	64712 B**

eiriktsarpalis commented 3 years ago

Honestly these numbers don't seem particularly surprising to me. There have been proposals like the one in #14477 for replacing the AVL tree implementation with something faster.

eiriktsarpalis commented 3 years ago

I'm going to close this issue, since it is not very actionable. I would recommend continuing the conversation in #14477 on potential perf improvements to the internal implementation of immutable hashtables.

dotnet / runtime

ImmutableDictionary has slowly been regressing and in version 5.0 is about 10X slower than Dictionary #47812