Closed stevenguh closed 3 years ago
Do you have information about the retained memory for this type? I'd like to compare it to the numbers in dotnet/runtime#14477.
Retained memory (storage density) is one of the primary concerns we've encountered when using the immutable collections.
I am not familiar with the term retained memory. Can you clarify by any chance?
As far as the storage density, it really depends on the hash of the objects.
You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case.
If you are talking the overhead on the node objects, I would also like to think it's minimal. However, for the exact number of each type of nodes. I will need to spend some time to calculate them.
@AArnott
I did calculate memory usage for a few scenarios as well: Memory size of map <int, int> with 10,000 count
Collection | Bytes |
---|---|
ImmutableDictionary | 560,200 |
ImmutableTrieDictionary | 476,192 |
HashMap | 164,664 |
10,000 count and snapshots every 1,000 adds
Collection | Bytes |
---|---|
ImmutableDictionary | 1,763,944 |
ImmutableTrieDictionary | 1,101,664 |
HashMap | 831,520 |
10,000 count and snapshots every add
Collection | Bytes |
---|---|
ImmutableDictionary | 7,871,992 |
ImmutableTrieDictionary | 8,117,056 |
HashMap | 7,791,488 |
I was a bit surprised to see ImmutableDictionary using more here, since HAMT can copy several arrays that can be up to length 32 for each modification... but I guess the much smaller depth of the tree makes up for that. Keeping a reference to every single modified collection ends up with about the same memory usage overall.
In my opinion, a lot more testing with various types for key/value would be needed to be sure that replacing the current ImmutableDictionary implementation would be a good idea - I only benchmarked <int, int> as I wanted to measure as much as possible the overhead of the structure only. One of the projects I am working on does use immutable structures extensively, so I may try dropping a replacement in and see what performance/memory impact it has.
@dbolin If you want to compare allocation performance with B+-trees, you can try using https://github.com/tunnelvisionlabs/dotnet-trees. ImmutableTreeList<T>
should perform nearly optimally for allocations, and ImmutableTreeDictionary<TKey, TValue>
should approach ideal as the collection grows in size (the implementation strategy of wrappers means three flyweight objects are created during each mutation that would be avoidable in an optimal implementation).
The use of wrappers was a means to an end so the API could be functional given limited time available to work on the library. I'd like to focus on getting a correct and high-performing implementation of ImmutableTreeList<T>
completed, and then work on eliminating the wrappers.
@dbolin Your benchmark might not be comprehensive enough (I am not say mine is). The problem I see is that you used Int32 as key, so your hash code will only be ranging from 0 to the benching param. This might not be representative of real word uses when people use object as key.
GetHashCode of Int32: https://referencesource.microsoft.com/#mscorlib/system/int32.cs,86
Updated benchmarks to include ImmutableTreeDictionary from TunnelVisionLabs
I also changed the integer keys to be randomized, and as expected this didn't make much difference, since hash code distributions should be uniform anyway. Also added some benchmarks using String keys.
Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys, 1,040,128 bytes with snapshots at each 1,000 keys, and 8,103,744 bytes with snapshots at every key added.
I didn't try anything with ImmutableTreeList yet since I assume the api will be a bit different from Dictionaries.
It's pretty cool to see others to have interests in HMAT, and a benchmarked other implementations. I think corefx should really look into HMAT to improve the performance of the their ImmutableDictionary, or at least some guidelines/requirements for a replacement implementation.
@dbolin Your implementation would not work on non x86 platform because of the infused use of System.Runtime.Intrinsics.X86, but that should be an easy fix. I am more curious to know the improvement it made. Btw, I didn't know they are adding Intrinsics in .net. That's pretty cool :)
@stevenguh Yeah, it just needs a fallback for when the intrinsic is not supported. Without the intrinsic, lookups (which are the most affected) are between 10-25% slower depending on the collection size.
You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case.
This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability.
Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak.
I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.
Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys
This is the main highlight of the collections I was working on. I'm not especially surprised to see it running slower or allocating more total data during mutations; much of that is a side effect of using a wrapper implementation strategy to prototype the collection and show that total retained memory is lower than alternatives.
and 8,103,744 bytes with snapshots at every key added
On 64-bit, the wrappers would be directly responsible for 480,000 bytes of this, so once the wrappers are removed it would come down to 7,623,744 bytes.
I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.
This seems to be true for adding but not removing - adding 10k <int,int> pairs allocates 560,080 bytes, and the resulting object is 560,200 bytes, whereas removing those allocates 386,048 bytes and the result is 152 bytes.
And here are benchmarks for the various builder implementations adding/removing respectively:
Method | Count | Mean | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|
ImmutableDictionary | 5 | 698.3 ns | 1.00 | 0.0223 | - | - | 360 B |
ImmutableTrieDictionary | 5 | 473.2 ns | 0.67 | 0.0441 | - | - | 664 B |
ImmutableTreeDictionary | 5 | 849.0 ns | 1.22 | 0.0478 | - | - | 776 B |
ApexHashMap | 5 | 204.7 ns | 0.29 | 0.0251 | - | - | 392 B |
ImmutableDictionary | 100 | 26,709.6 ns | 1.00 | 0.3141 | - | - | 5680 B |
ImmutableTrieDictionary | 100 | 9,775.9 ns | 0.37 | 0.6612 | - | - | 10056 B |
ImmutableTreeDictionary | 100 | 66,471.5 ns | 2.49 | 0.5291 | - | - | 9800 B |
ApexHashMap | 100 | 6,855.7 ns | 0.26 | 0.9507 | - | - | 14520 B |
ImmutableDictionary | 10000 | 5,518,456.4 ns | 1.00 | 22.2222 | - | - | 560080 B |
ImmutableTrieDictionary | 10000 | 1,351,410.6 ns | 0.24 | 26.0417 | 10.4167 | - | 1053152 B |
ImmutableTreeDictionary | 10000 | 18,271,681.9 ns | 3.31 | - | - | - | 950184 B |
ApexHashMap | 10000 | 1,027,264.7 ns | 0.19 | 75.8929 | 35.7143 | - | 1143144 B |
Method | Count | Mean | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|
ImmutableDictionary | 5 | 334.9 ns | 1.00 | 0.0094 | - | - | 152 B |
ImmutableTrieDictionary | 5 | 327.5 ns | 0.98 | 0.0224 | - | - | 360 B |
ImmutableTreeDictionary | 5 | 777.5 ns | 2.32 | 0.0388 | - | - | 616 B |
ApexHashMap | 5 | 187.1 ns | 0.56 | 0.0214 | - | - | 328 B |
ImmutableDictionary | 100 | 17,055.2 ns | 1.00 | 0.2714 | - | - | 4128 B |
ImmutableTrieDictionary | 100 | 7,222.5 ns | 0.42 | 0.2031 | - | - | 3360 B |
ImmutableTreeDictionary | 100 | 58,910.0 ns | 3.45 | 0.2341 | - | - | 6728 B |
ApexHashMap | 100 | 5,362.5 ns | 0.31 | 0.4952 | - | - | 7696 B |
ImmutableDictionary | 10000 | 4,834,737.9 ns | 1.00 | 15.6250 | - | - | 386048 B |
ImmutableTrieDictionary | 10000 | 1,000,926.0 ns | 0.21 | 15.6250 | - | - | 252584 B |
ImmutableTreeDictionary | 10000 | 16,224,479.1 ns | 3.36 | - | - | - | 605160 B |
ApexHashMap | 10000 | 854,110.4 ns | 0.18 | 41.6667 | 13.8889 | - | 654320 B |
For the case of adding, the total allocated memory is about double corefx, and 75% of that is garbage.
@dbolin To clarify, we're interested in the following sequence
Scenario 1:
Create a builder, and add a bunch of items to it
Scenario 2:
ToImmutable
Expected outcome:
@sharwell Assuming you mean a builder starting from empty, I'm very confident that it works like you expect. With a fixed branching factor there's never a need to reallocate mutable nodes. With HAMTs, the branching factor is anywhere from 2 to 32, so the array of branches must be reallocated whenever that factor changes. I'm sure the builder can be optimized to do better than mine, but I'd expect it to always produce a significant amount of garbage relative to the final size of the set.
This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability.
There are a few reasons, depending what is begin optimized. If we have a full size node (HashArrayMapNode) with a full size array (length of 32 -- branching factor of 32) in each node, then we don't have to use PopCount to find the index of an item. However, for each mutation, we will need to create another full array. On the other hand, if all the nodes are the ‘compressed’ node (BitmapIndexedNode) with the variable length array, then we will need to use PopCount to located the bucket with get/set/remove.
To answer you question, yes, having slack in the array would not allow you to add element later, except in the builder case. Therefore, it wouldn't violate immutability. For an item to be updated, all the nodes in the path to the leaf (value node) need to be copied.
In the builder case, we will ensure the node is "owned" by the builder, so I know that node has been copied and can mutate safely. (Continue below)
Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak.
The implementation does not suffer from retaining strong references. Whenever you update the collection, all the nodes traversed from the root node of the trie to the value node are required to be copied, and the rest will not need to be changed. Suppose you remove the reference of your element and the old collection, then a part of the trie will retain because it is used by the new collection; however, the part that got copied because of mutation will release and be garbage collected.
Let say in the follow diagram. root
is the initial trie, and root'
is the trie after an update of value stored in c
node. In order to change, we will need to copied the path to that value node. If we release the root
reference, part of the trie (root
, a
, and c
) will be garbage collected.
+------+ +-------+
| root | | root' |
+------+ +-------+
| | +---+ | |
| +---| b |----+ |
+---+ +---+ +----+
| a | | a' |
+---+ +----+
| |
+---+ +----+
| c | | c' |
+---+ +----+
I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here.
It wouldn't be garbage free as you can see in the benchmark. To understand how to builder works, I will use the example below.
First, we create a trie and it contains a key value pair of (k, v1).
At step (1), we created builder_a
from root by calling ToBuilder
and updated the value of key k
to v2
.
ToBuilder
operation is O(1) and will not create copied any node. The update operation then traverse to node c
and find key v
copied all the nodes along the path and tag them.
At step(2), we update the value of the key k
again to v3
.
The update operation traverse to the node c'
to find key k
and update the value without copied the node in the path because all nodes in the path are 'owned' by builder_a
.
At step(3), we created another builder from root
to create builder_b
, and update the key k
to v4
. This time we will need to copied all the nodes again in order to mutate.
+-------------(3)-------------------------------------------+
| |
| v
+------+ +-----------+ +-----------+ +-----------+
| root | --(1)--> | builder_a | --(2)-->| builder_a | | builder_b |
+------+ +-----------+ +-----------+ +-----------+
| | +---+ | | | | | |
| +----| b |----+--~|~----------------+-~|~--------------+ |
| +---+ | | |
+---+ +----+ +----+ +----+
| a | | a' | | a' | | a" |
+---+ +----+ +----+ +----+
| | | |
+---+ +----+ +----+ +----+
| c |->(k, v1) | c' |->(k, v2) | c' |->(k, v3) | c" |->(k, v4)
+---+ +----+ +----+ +----+
Therefore, mutation will require to copy all the nodes along the path if the nodes are not 'owned' by the builder.
Addition is a bit more complicated. As I mentioned before, there are roughly two types of node -- 'compressed' (BitmapIndexedNode) and 'full' (HashArrayMapNode). When adding in a builder, we will use compressed node at first to save memory; however, each addition to that node will require copying of the array in that node to expand unless there is space from previous delete operated by the same builder. If the number of items exceed 16 in the compressed node we will expand the nodes to full size by copying the array from the compressed node to 'full' size node. When adding is operated to the 'full' size node, we will not need copy array if the full size node is 'owned' by the builder.
Deletion of a 'compressed' or 'full' node that 'owned' by the builder will not create a new array. However, when the number of items in a 'full' node reduces to 8, then we will copy the array to the 'compressed' node.
In general, we try to balance the number of the empty buckets in a 'full' node and balance the speed of locating with the use of pop count in compressed node.
Also there are some additional facts about the implementation of the builder which I think is consistent with the current API, and I think going forward these are also the requirements for the replacement implementation?
ToBuilder
and ToImmutable
ToImmutable
should not mutate the immutable dictToBuilder
should not mutate the builderIn closing, I think we shouldn't benchmark with the use of intrinsics because .net core is designed to be cross platform. The baseline should measure without any intrinsic, because if the user happens to be on the platform with the right instruction. Great, the user gets extra boost. But we shouldn't use that as our benchmark since intrinsics aren't available on all platforms.
While I was looking into ways to optimize, I found out a great paper optimizing HAMT for JVM. I think we should evaluate that as well. I will try to implement it and see if there are significant improvement on memory usage and iteration speed. I haven't read the whole paper yet, but it seems very promising.
Paper: Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections: In addition, this paper provides a better write up on a regular HAMT than I did. https://michael.steindorfer.name/publications/oopsla15.pdf
The author's Phd thesis on efficient immutable collection: https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf
Implementation from the author in Java: https://github.com/usethesource/capsule
Kotlin uses CHAMP in their implementation of immutable data structure: https://github.com/Kotlin/kotlinx.collections.immutable/issues/6
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.
This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.
This issue will now be closed since it had been marked no recent activity
but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.
I've implemented a variant of hash array mapped trie and found out that it performs at around 2x faster than the current ImmutableDictionary with comparable memory usage. I am wondering if we can use that to replace the current implementation.
Here are some of the benchmarks I ran on my computer:
DictioanryAdd
DictionaryRemove
DictionarySetExisting