dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

[Proposal] Use hash array mapped trie for ImmutableDictionary #26001

Closed stevenguh closed 3 years ago

stevenguh commented 6 years ago

I've implemented a variant of hash array mapped trie and found out that it performs at around 2x faster than the current ImmutableDictionary with comparable memory usage. I am wondering if we can use that to replace the current implementation.

Here are some of the benchmarks I ran on my computer:


BenchmarkDotNet=v0.10.13, OS=macOS 10.13.4 (17E199) [Darwin 17.5.0]
Intel Core i7-4870HQ CPU 2.50GHz (Haswell), 1 CPU, 8 logical cores and 4 physical cores
.NET Core SDK=2.1.101
  [Host] : .NET Core 2.0.6 (CoreCLR 4.6.0.0, CoreFX 4.6.26212.01), 64bit RyuJIT
  Core   : .NET Core 2.0.6 (CoreCLR 4.6.0.0, CoreFX 4.6.26212.01), 64bit RyuJIT

Job=Core  Runtime=Core  

DictioanryAdd

Method N Mean Error StdDev Rank Gen 0 Gen 1 Gen 2 Allocated
ImmutableDictAdd 1000 1,527.8 us 9.983 us 9.338 us 3 128.9063 17.5781 - 796.73 KB
ImmutableTrieDictAdd 1000 685.7 us 4.602 us 4.080 us 2 121.0938 17.5781 - 748.79 KB
DictAdd 1000 307.6 us 2.727 us 2.551 us 1 23.9258 1.4648 - 148.36 KB
ImmutableDictAdd 10000 21,820.2 us 251.673 us 235.415 us 6 1656.2500 656.2500 - 10277.8 KB
ImmutableTrieDictAdd 10000 11,117.2 us 81.375 us 72.137 us 5 1562.5000 656.2500 78.1250 9673.99 KB
DictAdd 10000 3,235.3 us 25.886 us 22.947 us 4 273.4375 97.6563 - 1695.23 KB

DictionaryRemove

Method N Mean Error StdDev Rank Gen 0 Gen 1 Allocated
ImmutableDictRemove 1000 1,277.7 us 13.823 us 11.543 us 3 111.3281 5.8594 686.88 KB
ImmutableTrieDictRemove 1000 676.5 us 6.565 us 5.820 us 2 124.0234 5.8594 766.09 KB
DictRemove 1000 280.7 us 2.241 us 2.096 us 1 23.9258 - 148.36 KB
ImmutableDictRemove 10000 18,026.1 us 243.513 us 227.782 us 6 1468.7500 468.7500 9174.76 KB
ImmutableTrieDictRemove 10000 8,180.5 us 85.451 us 79.931 us 5 1515.6250 484.3750 9340.95 KB
DictRemove 10000 2,818.9 us 28.111 us 26.295 us 4 273.4375 - 1695.23 KB

DictionarySetExisting

Method N Mean Error StdDev Rank Gen 0 Gen 1 Allocated
ImmutableDictSet 1000 1,568.4 us 11.753 us 10.994 us 3 123.0469 15.6250 761.61 KB
ImmutableTrieDictSet 1000 741.0 us 3.764 us 3.337 us 2 144.5313 22.4609 892.46 KB
DictSet 1000 350.3 us 2.288 us 2.140 us 1 23.9258 - 148.36 KB
ImmutableDictSet 10000 23,856.3 us 211.640 us 197.968 us 6 1593.7500 718.7500 9935.11 KB
ImmutableTrieDictSet 10000 11,601.9 us 93.547 us 87.504 us 5 1671.8750 671.8750 10285.47 KB
DictSet 10000 3,254.8 us 24.538 us 22.953 us 4 273.4375 - 1695.23 KB
sharwell commented 6 years ago

Do you have information about the retained memory for this type? I'd like to compare it to the numbers in dotnet/runtime#14477.

Retained memory (storage density) is one of the primary concerns we've encountered when using the immutable collections.

stevenguh commented 6 years ago

I am not familiar with the term retained memory. Can you clarify by any chance?

As far as the storage density, it really depends on the hash of the objects.

You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case.

If you are talking the overhead on the node objects, I would also like to think it's minimal. However, for the exact number of each type of nodes. I will need to spend some time to calculate them.

danmoseley commented 6 years ago

@AArnott

dbolin commented 5 years ago

I have updated benchmarks of three hash array mapped trie implementations and ImmutableDictionary (HashMap is my own implementation)

I did calculate memory usage for a few scenarios as well: Memory size of map <int, int> with 10,000 count

Collection Bytes
ImmutableDictionary 560,200
ImmutableTrieDictionary 476,192
HashMap 164,664

10,000 count and snapshots every 1,000 adds

Collection Bytes
ImmutableDictionary 1,763,944
ImmutableTrieDictionary 1,101,664
HashMap 831,520

10,000 count and snapshots every add

Collection Bytes
ImmutableDictionary 7,871,992
ImmutableTrieDictionary 8,117,056
HashMap 7,791,488

I was a bit surprised to see ImmutableDictionary using more here, since HAMT can copy several arrays that can be up to length 32 for each modification... but I guess the much smaller depth of the tree makes up for that. Keeping a reference to every single modified collection ends up with about the same memory usage overall.

In my opinion, a lot more testing with various types for key/value would be needed to be sure that replacing the current ImmutableDictionary implementation would be a good idea - I only benchmarked <int, int> as I wanted to measure as much as possible the overhead of the structure only. One of the projects I am working on does use immutable structures extensively, so I may try dropping a replacement in and see what performance/memory impact it has.

sharwell commented 5 years ago

@dbolin If you want to compare allocation performance with B+-trees, you can try using https://github.com/tunnelvisionlabs/dotnet-trees. ImmutableTreeList<T> should perform nearly optimally for allocations, and ImmutableTreeDictionary<TKey, TValue> should approach ideal as the collection grows in size (the implementation strategy of wrappers means three flyweight objects are created during each mutation that would be avoidable in an optimal implementation).

The use of wrappers was a means to an end so the API could be functional given limited time available to work on the library. I'd like to focus on getting a correct and high-performing implementation of ImmutableTreeList<T> completed, and then work on eliminating the wrappers.

stevenguh commented 5 years ago

@dbolin Your benchmark might not be comprehensive enough (I am not say mine is). The problem I see is that you used Int32 as key, so your hash code will only be ranging from 0 to the benching param. This might not be representative of real word uses when people use object as key.

GetHashCode of Int32: https://referencesource.microsoft.com/#mscorlib/system/int32.cs,86

dbolin commented 5 years ago

Updated benchmarks to include ImmutableTreeDictionary from TunnelVisionLabs

I also changed the integer keys to be randomized, and as expected this didn't make much difference, since hash code distributions should be uniform anyway. Also added some benchmarks using String keys.

Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys, 1,040,128 bytes with snapshots at each 1,000 keys, and 8,103,744 bytes with snapshots at every key added.

I didn't try anything with ImmutableTreeList yet since I assume the api will be a bit different from Dictionaries.

stevenguh commented 5 years ago

It's pretty cool to see others to have interests in HMAT, and a benchmarked other implementations. I think corefx should really look into HMAT to improve the performance of the their ImmutableDictionary, or at least some guidelines/requirements for a replacement implementation.

stevenguh commented 5 years ago

@dbolin Your implementation would not work on non x86 platform because of the infused use of System.Runtime.Intrinsics.X86, but that should be an easy fix. I am more curious to know the improvement it made. Btw, I didn't know they are adding Intrinsics in .net. That's pretty cool :)

dbolin commented 5 years ago

@stevenguh Yeah, it just needs a fallback for when the intrinsic is not supported. Without the intrinsic, lookups (which are the most affected) are between 10-25% slower depending on the collection size.

AArnott commented 5 years ago

You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case.

This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability.

Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak.

I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.

sharwell commented 5 years ago

Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys

This is the main highlight of the collections I was working on. I'm not especially surprised to see it running slower or allocating more total data during mutations; much of that is a side effect of using a wrapper implementation strategy to prototype the collection and show that total retained memory is lower than alternatives.

and 8,103,744 bytes with snapshots at every key added

On 64-bit, the wrappers would be directly responsible for 480,000 bytes of this, so once the wrappers are removed it would come down to 7,623,744 bytes.

dbolin commented 5 years ago

I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.

This seems to be true for adding but not removing - adding 10k <int,int> pairs allocates 560,080 bytes, and the resulting object is 560,200 bytes, whereas removing those allocates 386,048 bytes and the result is 152 bytes.

And here are benchmarks for the various builder implementations adding/removing respectively:

Method Count Mean Ratio Gen 0 Gen 1 Gen 2 Allocated
ImmutableDictionary 5 698.3 ns 1.00 0.0223 - - 360 B
ImmutableTrieDictionary 5 473.2 ns 0.67 0.0441 - - 664 B
ImmutableTreeDictionary 5 849.0 ns 1.22 0.0478 - - 776 B
ApexHashMap 5 204.7 ns 0.29 0.0251 - - 392 B
ImmutableDictionary 100 26,709.6 ns 1.00 0.3141 - - 5680 B
ImmutableTrieDictionary 100 9,775.9 ns 0.37 0.6612 - - 10056 B
ImmutableTreeDictionary 100 66,471.5 ns 2.49 0.5291 - - 9800 B
ApexHashMap 100 6,855.7 ns 0.26 0.9507 - - 14520 B
ImmutableDictionary 10000 5,518,456.4 ns 1.00 22.2222 - - 560080 B
ImmutableTrieDictionary 10000 1,351,410.6 ns 0.24 26.0417 10.4167 - 1053152 B
ImmutableTreeDictionary 10000 18,271,681.9 ns 3.31 - - - 950184 B
ApexHashMap 10000 1,027,264.7 ns 0.19 75.8929 35.7143 - 1143144 B
Method Count Mean Ratio Gen 0 Gen 1 Gen 2 Allocated
ImmutableDictionary 5 334.9 ns 1.00 0.0094 - - 152 B
ImmutableTrieDictionary 5 327.5 ns 0.98 0.0224 - - 360 B
ImmutableTreeDictionary 5 777.5 ns 2.32 0.0388 - - 616 B
ApexHashMap 5 187.1 ns 0.56 0.0214 - - 328 B
ImmutableDictionary 100 17,055.2 ns 1.00 0.2714 - - 4128 B
ImmutableTrieDictionary 100 7,222.5 ns 0.42 0.2031 - - 3360 B
ImmutableTreeDictionary 100 58,910.0 ns 3.45 0.2341 - - 6728 B
ApexHashMap 100 5,362.5 ns 0.31 0.4952 - - 7696 B
ImmutableDictionary 10000 4,834,737.9 ns 1.00 15.6250 - - 386048 B
ImmutableTrieDictionary 10000 1,000,926.0 ns 0.21 15.6250 - - 252584 B
ImmutableTreeDictionary 10000 16,224,479.1 ns 3.36 - - - 605160 B
ApexHashMap 10000 854,110.4 ns 0.18 41.6667 13.8889 - 654320 B

For the case of adding, the total allocated memory is about double corefx, and 75% of that is garbage.

sharwell commented 5 years ago

@dbolin To clarify, we're interested in the following sequence

Scenario 1:

Create a builder, and add a bunch of items to it

Scenario 2:

  1. Create a builder, and add a bunch of items to it
  2. Do not call ToImmutable
  3. Remove the items from the builder

Expected outcome:

  1. Scenario 1 should not produce garbage (allocation should match final working set)
  2. Scenario 2 should not produce garbage (total allocations should match that of scenario 1)
dbolin commented 5 years ago

@sharwell Assuming you mean a builder starting from empty, I'm very confident that it works like you expect. With a fixed branching factor there's never a need to reallocate mutable nodes. With HAMTs, the branching factor is anywhere from 2 to 32, so the array of branches must be reallocated whenever that factor changes. I'm sure the builder can be optimized to do better than mine, but I'd expect it to always produce a significant amount of garbage relative to the final size of the set.

stevenguh commented 5 years ago

This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability.

There are a few reasons, depending what is begin optimized. If we have a full size node (HashArrayMapNode) with a full size array (length of 32 -- branching factor of 32) in each node, then we don't have to use PopCount to find the index of an item. However, for each mutation, we will need to create another full array. On the other hand, if all the nodes are the ‘compressed’ node (BitmapIndexedNode) with the variable length array, then we will need to use PopCount to located the bucket with get/set/remove.

To answer you question, yes, having slack in the array would not allow you to add element later, except in the builder case. Therefore, it wouldn't violate immutability. For an item to be updated, all the nodes in the path to the leaf (value node) need to be copied.

In the builder case, we will ensure the node is "owned" by the builder, so I know that node has been copied and can mutate safely. (Continue below)

Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak.

The implementation does not suffer from retaining strong references. Whenever you update the collection, all the nodes traversed from the root node of the trie to the value node are required to be copied, and the rest will not need to be changed. Suppose you remove the reference of your element and the old collection, then a part of the trie will retain because it is used by the new collection; however, the part that got copied because of mutation will release and be garbage collected.

Let say in the follow diagram. root is the initial trie, and root' is the trie after an update of value stored in c node. In order to change, we will need to copied the path to that value node. If we release the root reference, part of the trie (root, a, and c) will be garbage collected.

+------+         +-------+
| root |         | root' |
+------+         +-------+
  |   |   +---+    |   |
  |   +---| b |----+   |
+---+     +---+      +----+
| a |                | a' |
+---+                +----+
  |                    |
+---+                +----+
| c |                | c' |
+---+                +----+

I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.

It wouldn't be garbage free as you can see in the benchmark. To understand how to builder works, I will use the example below.

First, we create a trie and it contains a key value pair of (k, v1).

At step (1), we created builder_a from root by calling ToBuilder and updated the value of key k to v2. ToBuilder operation is O(1) and will not create copied any node. The update operation then traverse to node c and find key v copied all the nodes along the path and tag them.

At step(2), we update the value of the key k again to v3. The update operation traverse to the node c' to find key k and update the value without copied the node in the path because all nodes in the path are 'owned' by builder_a.

At step(3), we created another builder from root to create builder_b, and update the key k to v4. This time we will need to copied all the nodes again in order to mutate.

   +-------------(3)-------------------------------------------+  
   |                                                           |
   |                                                           v
+------+          +-----------+         +-----------+     +-----------+
| root | --(1)--> | builder_a | --(2)-->| builder_a |     | builder_b |
+------+          +-----------+         +-----------+     +-----------+
  |   |    +---+    |   |                 |  |               |  |
  |   +----| b |----+--~|~----------------+-~|~--------------+  |
  |        +---+        |                    |                  |
+---+                 +----+               +----+             +----+
| a |                 | a' |               | a' |             | a" |
+---+                 +----+               +----+             +----+
  |                     |                    |                  |
+---+                 +----+               +----+             +----+
| c |->(k, v1)        | c' |->(k, v2)      | c' |->(k, v3)    | c" |->(k, v4)
+---+                 +----+               +----+             +----+

Therefore, mutation will require to copy all the nodes along the path if the nodes are not 'owned' by the builder.

Addition is a bit more complicated. As I mentioned before, there are roughly two types of node -- 'compressed' (BitmapIndexedNode) and 'full' (HashArrayMapNode). When adding in a builder, we will use compressed node at first to save memory; however, each addition to that node will require copying of the array in that node to expand unless there is space from previous delete operated by the same builder. If the number of items exceed 16 in the compressed node we will expand the nodes to full size by copying the array from the compressed node to 'full' size node. When adding is operated to the 'full' size node, we will not need copy array if the full size node is 'owned' by the builder.

Deletion of a 'compressed' or 'full' node that 'owned' by the builder will not create a new array. However, when the number of items in a 'full' node reduces to 8, then we will copy the array to the 'compressed' node.

In general, we try to balance the number of the empty buckets in a 'full' node and balance the speed of locating with the use of pop count in compressed node.

Also there are some additional facts about the implementation of the builder which I think is consistent with the current API, and I think going forward these are also the requirements for the replacement implementation?

In closing, I think we shouldn't benchmark with the use of intrinsics because .net core is designed to be cross platform. The baseline should measure without any intrinsic, because if the user happens to be on the platform with the right instruction. Great, the user gets extra boost. But we shouldn't use that as our benchmark since intrinsics aren't available on all platforms.

While I was looking into ways to optimize, I found out a great paper optimizing HAMT for JVM. I think we should evaluate that as well. I will try to implement it and see if there are significant improvement on memory usage and iteration speed. I haven't read the whole paper yet, but it seems very promising.

Paper: Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections: In addition, this paper provides a better write up on a regular HAMT than I did. https://michael.steindorfer.name/publications/oopsla15.pdf

The author's Phd thesis on efficient immutable collection: https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf

Implementation from the author in Java: https://github.com/usethesource/capsule

Kotlin uses CHAMP in their implementation of immutable data structure: https://github.com/Kotlin/kotlinx.collections.immutable/issues/6

ghost commented 3 years ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.

ghost commented 3 years ago

This issue will now be closed since it had been marked no recent activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.