[Proposal] Use hash array mapped trie for ImmutableDictionary

stevenguh commented 6 years ago

I've implemented a variant of hash array mapped trie and found out that it performs at around 2x faster than the current ImmutableDictionary with comparable memory usage. I am wondering if we can use that to replace the current implementation.

Here are some of the benchmarks I ran on my computer:


BenchmarkDotNet=v0.10.13, OS=macOS 10.13.4 (17E199) [Darwin 17.5.0]
Intel Core i7-4870HQ CPU 2.50GHz (Haswell), 1 CPU, 8 logical cores and 4 physical cores
.NET Core SDK=2.1.101
  [Host] : .NET Core 2.0.6 (CoreCLR 4.6.0.0, CoreFX 4.6.26212.01), 64bit RyuJIT
  Core   : .NET Core 2.0.6 (CoreCLR 4.6.0.0, CoreFX 4.6.26212.01), 64bit RyuJIT

Job=Core  Runtime=Core

DictioanryAdd

Method	N	Mean	Error	StdDev	Rank	Gen 0	Gen 1	Gen 2	Allocated
ImmutableDictAdd	1000	1,527.8 us	9.983 us	9.338 us	3	128.9063	17.5781	-	796.73 KB
ImmutableTrieDictAdd	1000	685.7 us	4.602 us	4.080 us	2	121.0938	17.5781	-	748.79 KB
DictAdd	1000	307.6 us	2.727 us	2.551 us	1	23.9258	1.4648	-	148.36 KB
ImmutableDictAdd	10000	21,820.2 us	251.673 us	235.415 us	6	1656.2500	656.2500	-	10277.8 KB
ImmutableTrieDictAdd	10000	11,117.2 us	81.375 us	72.137 us	5	1562.5000	656.2500	78.1250	9673.99 KB
DictAdd	10000	3,235.3 us	25.886 us	22.947 us	4	273.4375	97.6563	-	1695.23 KB

DictionaryRemove

Method	N	Mean	Error	StdDev	Rank	Gen 0	Gen 1	Allocated
ImmutableDictRemove	1000	1,277.7 us	13.823 us	11.543 us	3	111.3281	5.8594	686.88 KB
ImmutableTrieDictRemove	1000	676.5 us	6.565 us	5.820 us	2	124.0234	5.8594	766.09 KB
DictRemove	1000	280.7 us	2.241 us	2.096 us	1	23.9258	-	148.36 KB
ImmutableDictRemove	10000	18,026.1 us	243.513 us	227.782 us	6	1468.7500	468.7500	9174.76 KB
ImmutableTrieDictRemove	10000	8,180.5 us	85.451 us	79.931 us	5	1515.6250	484.3750	9340.95 KB
DictRemove	10000	2,818.9 us	28.111 us	26.295 us	4	273.4375	-	1695.23 KB

DictionarySetExisting

Method	N	Mean	Error	StdDev	Rank	Gen 0	Gen 1	Allocated
ImmutableDictSet	1000	1,568.4 us	11.753 us	10.994 us	3	123.0469	15.6250	761.61 KB
ImmutableTrieDictSet	1000	741.0 us	3.764 us	3.337 us	2	144.5313	22.4609	892.46 KB
DictSet	1000	350.3 us	2.288 us	2.140 us	1	23.9258	-	148.36 KB
ImmutableDictSet	10000	23,856.3 us	211.640 us	197.968 us	6	1593.7500	718.7500	9935.11 KB
ImmutableTrieDictSet	10000	11,601.9 us	93.547 us	87.504 us	5	1671.8750	671.8750	10285.47 KB
DictSet	10000	3,254.8 us	24.538 us	22.953 us	4	273.4375	-	1695.23 KB

sharwell commented 6 years ago

Do you have information about the retained memory for this type? I'd like to compare it to the numbers in dotnet/runtime#14477.

Retained memory (storage density) is one of the primary concerns we've encountered when using the immutable collections.

stevenguh commented 6 years ago

I am not familiar with the term retained memory. Can you clarify by any chance?

As far as the storage density, it really depends on the hash of the objects.

You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case.

If you are talking the overhead on the node objects, I would also like to think it's minimal. However, for the exact number of each type of nodes. I will need to spend some time to calculate them.

danmoseley commented 6 years ago

@AArnott

dbolin commented 5 years ago

I have updated benchmarks of three hash array mapped trie implementations and ImmutableDictionary (HashMap is my own implementation)

I did calculate memory usage for a few scenarios as well: Memory size of map <int, int> with 10,000 count

Collection	Bytes
ImmutableDictionary	560,200
ImmutableTrieDictionary	476,192
HashMap	164,664

10,000 count and snapshots every 1,000 adds

Collection	Bytes
ImmutableDictionary	1,763,944
ImmutableTrieDictionary	1,101,664
HashMap	831,520

10,000 count and snapshots every add

Collection	Bytes
ImmutableDictionary	7,871,992
ImmutableTrieDictionary	8,117,056
HashMap	7,791,488

I was a bit surprised to see ImmutableDictionary using more here, since HAMT can copy several arrays that can be up to length 32 for each modification... but I guess the much smaller depth of the tree makes up for that. Keeping a reference to every single modified collection ends up with about the same memory usage overall.

In my opinion, a lot more testing with various types for key/value would be needed to be sure that replacing the current ImmutableDictionary implementation would be a good idea - I only benchmarked <int, int> as I wanted to measure as much as possible the overhead of the structure only. One of the projects I am working on does use immutable structures extensively, so I may try dropping a replacement in and see what performance/memory impact it has.

sharwell commented 5 years ago

@dbolin If you want to compare allocation performance with B+-trees, you can try using https://github.com/tunnelvisionlabs/dotnet-trees. ImmutableTreeList<T> should perform nearly optimally for allocations, and ImmutableTreeDictionary<TKey, TValue> should approach ideal as the collection grows in size (the implementation strategy of wrappers means three flyweight objects are created during each mutation that would be avoidable in an optimal implementation).

The use of wrappers was a means to an end so the API could be functional given limited time available to work on the library. I'd like to focus on getting a correct and high-performing implementation of ImmutableTreeList<T> completed, and then work on eliminating the wrappers.

stevenguh commented 5 years ago

@dbolin Your benchmark might not be comprehensive enough (I am not say mine is). The problem I see is that you used Int32 as key, so your hash code will only be ranging from 0 to the benching param. This might not be representative of real word uses when people use object as key.

GetHashCode of Int32: https://referencesource.microsoft.com/#mscorlib/system/int32.cs,86

dbolin commented 5 years ago

Updated benchmarks to include ImmutableTreeDictionary from TunnelVisionLabs

I also changed the integer keys to be randomized, and as expected this didn't make much difference, since hash code distributions should be uniform anyway. Also added some benchmarks using String keys.

Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys, 1,040,128 bytes with snapshots at each 1,000 keys, and 8,103,744 bytes with snapshots at every key added.

I didn't try anything with ImmutableTreeList yet since I assume the api will be a bit different from Dictionaries.

stevenguh commented 5 years ago

It's pretty cool to see others to have interests in HMAT, and a benchmarked other implementations. I think corefx should really look into HMAT to improve the performance of the their ImmutableDictionary, or at least some guidelines/requirements for a replacement implementation.

stevenguh commented 5 years ago

@dbolin Your implementation would not work on non x86 platform because of the infused use of System.Runtime.Intrinsics.X86, but that should be an easy fix. I am more curious to know the improvement it made. Btw, I didn't know they are adding Intrinsics in .net. That's pretty cool :)

dbolin commented 5 years ago

@stevenguh Yeah, it just needs a fallback for when the intrinsic is not supported. Without the intrinsic, lookups (which are the most affected) are between 10-25% slower depending on the collection size.

AArnott commented 5 years ago

You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case.

This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability.

Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak.

I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.

sharwell commented 5 years ago

Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys

This is the main highlight of the collections I was working on. I'm not especially surprised to see it running slower or allocating more total data during mutations; much of that is a side effect of using a wrapper implementation strategy to prototype the collection and show that total retained memory is lower than alternatives.

and 8,103,744 bytes with snapshots at every key added

On 64-bit, the wrappers would be directly responsible for 480,000 bytes of this, so once the wrappers are removed it would come down to 7,623,744 bytes.

dbolin commented 5 years ago

I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.

This seems to be true for adding but not removing - adding 10k <int,int> pairs allocates 560,080 bytes, and the resulting object is 560,200 bytes, whereas removing those allocates 386,048 bytes and the result is 152 bytes.

And here are benchmarks for the various builder implementations adding/removing respectively:

Method	Count	Mean	Ratio	Gen 0	Gen 1	Gen 2	Allocated
ImmutableDictionary	5	698.3 ns	1.00	0.0223	-	-	360 B
ImmutableTrieDictionary	5	473.2 ns	0.67	0.0441	-	-	664 B
ImmutableTreeDictionary	5	849.0 ns	1.22	0.0478	-	-	776 B
ApexHashMap	5	204.7 ns	0.29	0.0251	-	-	392 B

ImmutableDictionary	100	26,709.6 ns	1.00	0.3141	-	-	5680 B
ImmutableTrieDictionary	100	9,775.9 ns	0.37	0.6612	-	-	10056 B
ImmutableTreeDictionary	100	66,471.5 ns	2.49	0.5291	-	-	9800 B
ApexHashMap	100	6,855.7 ns	0.26	0.9507	-	-	14520 B

ImmutableDictionary	10000	5,518,456.4 ns	1.00	22.2222	-	-	560080 B
ImmutableTrieDictionary	10000	1,351,410.6 ns	0.24	26.0417	10.4167	-	1053152 B
ImmutableTreeDictionary	10000	18,271,681.9 ns	3.31	-	-	-	950184 B
ApexHashMap	10000	1,027,264.7 ns	0.19	75.8929	35.7143	-	1143144 B

Method	Count	Mean	Ratio	Gen 0	Gen 1	Gen 2	Allocated
ImmutableDictionary	5	334.9 ns	1.00	0.0094	-	-	152 B
ImmutableTrieDictionary	5	327.5 ns	0.98	0.0224	-	-	360 B
ImmutableTreeDictionary	5	777.5 ns	2.32	0.0388	-	-	616 B
ApexHashMap	5	187.1 ns	0.56	0.0214	-	-	328 B

ImmutableDictionary	100	17,055.2 ns	1.00	0.2714	-	-	4128 B
ImmutableTrieDictionary	100	7,222.5 ns	0.42	0.2031	-	-	3360 B
ImmutableTreeDictionary	100	58,910.0 ns	3.45	0.2341	-	-	6728 B
ApexHashMap	100	5,362.5 ns	0.31	0.4952	-	-	7696 B

ImmutableDictionary	10000	4,834,737.9 ns	1.00	15.6250	-	-	386048 B
ImmutableTrieDictionary	10000	1,000,926.0 ns	0.21	15.6250	-	-	252584 B
ImmutableTreeDictionary	10000	16,224,479.1 ns	3.36	-	-	-	605160 B
ApexHashMap	10000	854,110.4 ns	0.18	41.6667	13.8889	-	654320 B

For the case of adding, the total allocated memory is about double corefx, and 75% of that is garbage.

sharwell commented 5 years ago

@dbolin To clarify, we're interested in the following sequence

Scenario 1:

Create a builder, and add a bunch of items to it

Scenario 2:

Create a builder, and add a bunch of items to it
Do not call ToImmutable
Remove the items from the builder

Expected outcome:

Scenario 1 should not produce garbage (allocation should match final working set)
Scenario 2 should not produce garbage (total allocations should match that of scenario 1)

dbolin commented 5 years ago

@sharwell Assuming you mean a builder starting from empty, I'm very confident that it works like you expect. With a fixed branching factor there's never a need to reallocate mutable nodes. With HAMTs, the branching factor is anywhere from 2 to 32, so the array of branches must be reallocated whenever that factor changes. I'm sure the builder can be optimized to do better than mine, but I'd expect it to always produce a significant amount of garbage relative to the final size of the set.

stevenguh commented 5 years ago

This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability.

There are a few reasons, depending what is begin optimized. If we have a full size node (HashArrayMapNode) with a full size array (length of 32 -- branching factor of 32) in each node, then we don't have to use PopCount to find the index of an item. However, for each mutation, we will need to create another full array. On the other hand, if all the nodes are the ‘compressed’ node (BitmapIndexedNode) with the variable length array, then we will need to use PopCount to located the bucket with get/set/remove.

To answer you question, yes, having slack in the array would not allow you to add element later, except in the builder case. Therefore, it wouldn't violate immutability. For an item to be updated, all the nodes in the path to the leaf (value node) need to be copied.

In the builder case, we will ensure the node is "owned" by the builder, so I know that node has been copied and can mutate safely. (Continue below)

Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak.

The implementation does not suffer from retaining strong references. Whenever you update the collection, all the nodes traversed from the root node of the trie to the value node are required to be copied, and the rest will not need to be changed. Suppose you remove the reference of your element and the old collection, then a part of the trie will retain because it is used by the new collection; however, the part that got copied because of mutation will release and be garbage collected.

Let say in the follow diagram. root is the initial trie, and root' is the trie after an update of value stored in c node. In order to change, we will need to copied the path to that value node. If we release the root reference, part of the trie (root, a, and c) will be garbage collected.

+------+         +-------+
| root |         | root' |
+------+         +-------+
  |   |   +---+    |   |
  |   +---| b |----+   |
+---+     +---+      +----+
| a |                | a' |
+---+                +----+
  |                    |
+---+                +----+
| c |                | c' |
+---+                +----+

I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.

It wouldn't be garbage free as you can see in the benchmark. To understand how to builder works, I will use the example below.

First, we create a trie and it contains a key value pair of (k, v1).

At step (1), we created builder_a from root by calling ToBuilder and updated the value of key k to v2. ToBuilder operation is O(1) and will not create copied any node. The update operation then traverse to node c and find key v copied all the nodes along the path and tag them.

At step(2), we update the value of the key k again to v3. The update operation traverse to the node c' to find key k and update the value without copied the node in the path because all nodes in the path are 'owned' by builder_a.

At step(3), we created another builder from root to create builder_b, and update the key k to v4. This time we will need to copied all the nodes again in order to mutate.

   +-------------(3)-------------------------------------------+  
   |                                                           |
   |                                                           v
+------+          +-----------+         +-----------+     +-----------+
| root | --(1)--> | builder_a | --(2)-->| builder_a |     | builder_b |
+------+          +-----------+         +-----------+     +-----------+
  |   |    +---+    |   |                 |  |               |  |
  |   +----| b |----+--~|~----------------+-~|~--------------+  |
  |        +---+        |                    |                  |
+---+                 +----+               +----+             +----+
| a |                 | a' |               | a' |             | a" |
+---+                 +----+               +----+             +----+
  |                     |                    |                  |
+---+                 +----+               +----+             +----+
| c |->(k, v1)        | c' |->(k, v2)      | c' |->(k, v3)    | c" |->(k, v4)
+---+                 +----+               +----+             +----+

Therefore, mutation will require to copy all the nodes along the path if the nodes are not 'owned' by the builder.

Addition is a bit more complicated. As I mentioned before, there are roughly two types of node -- 'compressed' (BitmapIndexedNode) and 'full' (HashArrayMapNode). When adding in a builder, we will use compressed node at first to save memory; however, each addition to that node will require copying of the array in that node to expand unless there is space from previous delete operated by the same builder. If the number of items exceed 16 in the compressed node we will expand the nodes to full size by copying the array from the compressed node to 'full' size node. When adding is operated to the 'full' size node, we will not need copy array if the full size node is 'owned' by the builder.

Deletion of a 'compressed' or 'full' node that 'owned' by the builder will not create a new array. However, when the number of items in a 'full' node reduces to 8, then we will copy the array to the 'compressed' node.

In general, we try to balance the number of the empty buckets in a 'full' node and balance the speed of locating with the use of pop count in compressed node.

Also there are some additional facts about the implementation of the builder which I think is consistent with the current API, and I think going forward these are also the requirements for the replacement implementation?

O(1) for ToBuilder and ToImmutable
The enumerator of a builder will throw if the builder is mutated after creation of enumeration.
Any mutation to a builder after ToImmutable should not mutate the immutable dict
Any mutation to a immutable dict after ToBuilder should not mutate the builder

In closing, I think we shouldn't benchmark with the use of intrinsics because .net core is designed to be cross platform. The baseline should measure without any intrinsic, because if the user happens to be on the platform with the right instruction. Great, the user gets extra boost. But we shouldn't use that as our benchmark since intrinsics aren't available on all platforms.

While I was looking into ways to optimize, I found out a great paper optimizing HAMT for JVM. I think we should evaluate that as well. I will try to implement it and see if there are significant improvement on memory usage and iteration speed. I haven't read the whole paper yet, but it seems very promising.

Paper: Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections: In addition, this paper provides a better write up on a regular HAMT than I did. https://michael.steindorfer.name/publications/oopsla15.pdf

The author's Phd thesis on efficient immutable collection: https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf

Implementation from the author in Java: https://github.com/usethesource/capsule

Kotlin uses CHAMP in their implementation of immutable data structure: https://github.com/Kotlin/kotlinx.collections.immutable/issues/6

ghost commented 3 years ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.

ghost commented 3 years ago

This issue will now be closed since it had been marked no recent activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

dotnet / runtime