Open sharwell opened 9 years ago
cc: @AArnott, @pharring
I'm not very familiar with B+-trees, but my first concern is whether they make it necessary to discard more of the tree than AVL trees do on mutations. If I'm reading it correctly, @sharwell addresses that concern and suggests that gets even better (due to their being shallower trees overall?). It sounds better for memory, and we know that n-ary trees in general have better performance than AVL trees for lookups. So it sounds like this is a promising idea.
@AArnott You are reading that correctly. For "standard" collections, the value for b is generally chosen for best interaction with the garbage collector and processor caches (e.g. relatively large pages for great locality, but intentionally small enough to keep them out of the large object heap for most expected types). I had excellent results with b set to 128 or larger for this type of data structure.
In the case of Immutable Collections, I approached the analysis from the other side of minimizing the cost of mutations.
Great. Another concern is construction performance. When we had a 4-ary tree, construction was much slower.
Seems like a nice improvement to me. We've avoided ImmutableList
+1 :+1:
One other thing to keep tabs on is what it means for Builders.
Rust language uses B-tree for maps and sets :D
This is likely to make the mutators execution times worse (Add, Insert, etc), because of the increased footprint of the nodes to clone/alter. Would it bother you less than the overall memory consumption?
... because of the increased footprint of the nodes to clone/alter ...
@evilguest The proposal is explicitly based on the idea that this is not a problem. While each index node in the B+-tree is larger than a single node in the AVL tree, many fewer nodes need to be modified during a mutation operation. For the sample collection size of 100,000 elements, the single-mutation memory allocation reduction is 39% on X64 and 21% on X86. There may be a few collection sizes that do not benefit from this, but I'm not sure what those are right now.
Thanks. I am experimenting the same direction as https://github.com/tunnelvisionlabs/dotnet-trees and the factory implementation of the System.Collections.Immutable.ImmutableList is surprisingly hard to beat from the performance standpoint. E.g. ImmutableList
I am still benchmarking the modifications, but expect similar issues (as almost 50% of the work is finding the leaf to clone).
@evilguest Did you get to benchmarking your modifications?
@arkalyanms the biggest help would be a bunch of benchmark scenarios to establish baselines. I can give you an overview of the in-flight work tomorrow. š
My benchmarking suite can be found at https://github.com/evilguest/atropos/tree/main/Atropos.Benchmarks
I am not yet happy with the List implementation because of both memory overhead and underperformance for the modification operations.
I'm potentially interested in working on this.
@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here.
Certainly for elapsed time - @kunalspathak does the lab detect allocation regressions as well?
@kunalspathak does the lab detect allocation regressions as well?
No. There is a way to see the allocation regression, but our analysis is purely based on the execution time. @DrewScoggins
Any changes to these collections in this regard will require careful evaluation in Roslyn and all of its perf tests as well.
There's a few Roslyn benchmarks in dotnet/performance, but not much. We maybe should amp those up there. @jaredpar do you have benchmarks that would focus on the immutable collections?
It would be good to ask the VS Core Project team to evaluate impact to Visual Studio's CPS as well, as they use these heavily. That said, I'm super enthused at the prospect of improving perf by changing the implementation.
@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here.
@danmoseley are there particular benchmarks to be added? Looking at this folder in the benchmarks repo it seems like some of the benchmarks incorporate the immutable collections but many do not. Is the thought to flesh that out or to create specific targeted benchmarks for the immutable collections?
@madelson right, there is not much coverage of the concurrent collections there today. We have not changed the implementations much. If we're contemplating more radical changes, we first need to ensure that any interesting performance regression (one that would impact Roslyn would certainly qualify) would be detected by our set of benchmarks. So I'm suggesting perhaps some investment into those.
We maybe should amp those up there. @jaredpar do you have benchmarks that would focus on the immutable collections?
No we do not have those that we could migrate. But @sharwell generally handles performance for Roslyn overall so expect he'd be testing our key scenarios as a part of evaluating this change
Note that Roslyn is becoming a bit desensitized to this issue. Since many of our performance-sensitive areas construct a collection which is subsequently read but not mutated, we've been migrating to segmented collections (essentially a 2-level B+ tree with variable-sized index level and constant-count leaves). CPS and MSBuild are two other candidates for this performance testing though.
I'm going to leave this comment here to remind myself of a future API proposal if this issue makes it in.
Th current proposal is for a fixed b. But there is value in a configurable b. Immutable Collections are used in a broad range of applications and allowing the user to choose the arity, gives developers more flexibility.
Need faster read performance? Great, increase the arity to make scans and seeks faster. Need to update the collection alot? Lower the arity to reduce copying and make updates more efficient.
As a side note, I would also envy a scan API that would take advantage of the internal structure and allow block based iteration. This shape might need some work to allow the action to stop the scan early.
public static class CollectionsMarshal
{
public static void Scan<TArg, T>(ImmutableList<T> list, TArg arg, ReadOnlySpanAction<TArg, T> action);
public static void ScanRange<TArg, T>(ImmutableList<T> list, TArg arg, Range range, ReadOnlySpanAction<TArg, T> action);
}
To try it out, I made a prototype of a B+ tree ImmutableList<T>
with a subset of operators and benchmarked it against the .NET 6 ImmutableList<T>
. There's still a lot to be improved, but the results seem promising (for the ratios, lower is better and the current ImmutableList<T>
is 1.0):
Benchmark | Time Ratio | Bytes Allocated Ratio |
---|---|---|
Indexer for loop 5 ints | 0.56 | n/a |
Indexer for loop 512 ints | 0.77 | n/a |
Indexer for loop 10K ints | 0.59 | n/a |
Indexer for loop 5 strings | 0.85 | n/a |
Indexer for loop 512 strings | 0.95 | n/a |
Indexer for loop 10K strings | 0.70 | n/a |
Add 5 ints | 0.35 | 0.47 |
Add 512 ints | 0.28 | 0.62 |
Add 10K ints | 0.22 | 0.65 |
Add 5 strings | 0.58 | 0.54 |
Add 512 strings | 0.41 | 0.60 |
Add 10K strings | 0.31 | 0.61 |
Insert 5 ints in random locations | 0.49 | 0.54 |
Insert 512 ints in random locations | 0.40 | 0.76 |
Insert 10K ints in random locations | 0.33 | 0.91 |
Insert 5 strings in random locations | 0.84 | 0.62 |
Insert 512 strings in random locations | 0.55 | 0.80 |
Insert 10K strings in random locations | 0.55 | 0.86 |
CreateRange 5 ints | 0.99 | 1.28 |
CreateRange 512 ints | 0.49 | 0.14 |
CreateRange 10K ints | 0.39 | 0.11 |
CreateRange 5 strings | 0.91 | 1.00 |
CreateRange 512 strings | 0.59 | 0.25 |
CreateRange 10K strings | 0.57 | 0.23 |
AddRange 50 ints to 50 ints | 0.66 | 0.33 |
AddRange 10K ints to 10K ints | 0.42 | 0.11 |
AddRange 50 strings to 50 strings | 0.62 | 0.33 |
AddRange 10K strings to 10K strings | 0.59 | 0.23 |
Benchmark info:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1526 (21H1/May2021Update)
Intel Core i7-3632QM CPU 2.20GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.200
[Host] : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
DefaultJob : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
Those numbers look great. Indexing and adding looks good. I'm interested in moving forward with this.
@AArnott if this is something I can contribute, how would you recommend going about the development?
For example, would it make sense for me to do a writeup of the proposed design to get feedback/alignment on that as a first step?
I'm also curious what the benchmarking/validation process should be. I did add some additional benchmarks to dotnet/performance (one set got merged, the other is still in PR), but I could imagine that for a change like this it could be desirable to have additional more detailed benchmarks for a side-by-side comparison (as I've posted above). There was also discussion of trying out the new approach in MSBuild. In addition, some way to run the benchmark suite on different operating systems seems important.
Good questions, @madelson. While I wrote the current implementation of the immutable collections, I don't typically work in this repo so I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for.
I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that. The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree?
I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for.
Makes sense.
I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that.
I've thought about this a bit but haven't built anything. Obviously the fact that I'm using arrays as nodes prevents us from following the current pattern of having a "frozen" bool on each node.
The first step is representing mutability. For internal nodes, I believe we can use the highest-order bit of the CumulativeChildCount
field (currently unused because all counts are non-negative) to indicate that that child is mutable. The builder object itself would additionally store a flag to indicate whether or not the root node is mutable. Therefore, as we traverse down the tree we always know whether or not the current node we're looking at is mutable based on context passed from the parent caller.
Using this marking system, we can create a new builder in O(1) time/memory and we can convert back to an immutable structure in-place by traversing the tree and freezing all mutable nodes we encounter (short-circuiting on already-frozen nodes) just like we do today.
One design question with this approach is how much we should try to unify the builder and immutable methods. For example, the builder methods would always need to be masking out the first bit when reading CumulativeChildCount
. If we share all of the code, we'd be taking on that extra overhead in the immutable case as well.
Another design question is whether the builder should attempt to use partially-full arrays to eliminate allocations. For example, in immutable world if we have a leaf of size 4 we represent it with just a 4-element array. Same thing goes for internal nodes. In builder world I see 2 options:
(1) We could continue this practice of using arrays that are just the right size, which means arrays are replaced whenever they "grow"
(2) We could instead adopt an approach where we double the size of the array any time it runs out of space (ala List<T>
), thus leaving room to "grow" the array in-place.
With approach (1), we end up allocating a new leaf node for every builder add/insert, and much less frequently we'll need to allocate new internal nodes as well. In contrast, approach (2) can sometimes add/insert without creating any new nodes.
On the other hand, when using approach (1) converting back to the immutable structure is trivial whereas for approach (2) we either have to "compact" many nodes or we have to make the immutable structure able to handle partially-full nodes and accept the loss of space efficiency. Approach (1) also leads to simpler code because we can use the array's length as the node count whereas with approach (2) we have to flow the true count down from the parent node.
All things considered, I would probably lean towards approach (1) because it seems simpler and because the existing algorithm's builder has to create a new leaf node for every add/insert anyway so we're not losing a ton there, but I'm not sure. I'm kind of assuming that builders are a niche use-case since you're not using ImmutableList<T>
unless you want the persistant/functional datastructure aspects.
@madelson I'd be interested to see those benchmarks run against the implementation in tunnelvisionlabs/dotnet-trees. I have a builder implementation there which adheres to the complexity bounds mentioned by @AArnott.
@sharwell here's what I got when I added TunnelVision's ImmutableTreeList
to the benchmarks (using the NuGet package):
Benchmark | Time Ratio (Prototype) | Bytes Allocated Ratio (Prototype) | Time Ratio (TunnelVision) | Bytes Allocated Ratio (TunnelVision) |
---|---|---|---|---|
Index\<Int32> (5) | 0.56 | NaN | 1.79 | NaN |
Index\<Int32> (512) | 0.80 | NaN | 6.84 | NaN |
Index\<Int32> (10K) | 0.59 | NaN | 5.57 | ā |
Index\<String> (5) | 0.99 | NaN | 1.14 | NaN |
Index\<String> (512) | 0.83 | NaN | 2.34 | NaN |
Index\<String> (10K) | 0.81 | 1.00 | 2.89 | 2.00 |
Add\<Int32> (5) | 0.34 | 0.47 | 0.92 | 0.54 |
Add\<Int32> (512) | 0.28 | 0.62 | 2.41 | 0.70 |
Add\<Int32> (10K) | 0.21 | 0.65 | 1.76 | 0.80 |
Add\<String> (5) | 0.56 | 0.54 | 1.00 | 0.75 |
Add\<String> (512) | 0.43 | 0.60 | 2.52 | 0.77 |
Add\<String> (10K) | 0.34 | 0.61 | 1.77 | 0.85 |
Insert\<Int32> (5) | 0.49 | 0.54 | 1.07 | 0.62 |
Insert\<Int32> (512) | 0.38 | 0.75 | 3.12 | 0.88 |
Insert\<Int32> (10K) | 0.34 | 0.91 | 2.55 | 0.89 |
Insert\<String> (5) | 0.79 | 0.62 | 1.28 | 0.86 |
Insert\<String> (512) | 0.60 | 0.80 | 3.42 | 0.97 |
Insert\<String> (10K) | 0.46 | 0.85 | 2.17 | 0.94 |
CreateRange\<Int32> (5) | 0.90 | 1.28 | 1.31 | 0.69 |
CreateRange\<Int32> (512) | 0.48 | 0.14 | 2.35 | 0.23 |
CreateRange\<Int32> (10K) | 0.40 | 0.11 | 3.30 | 0.24 |
CreateRange\<String> (5) | 1.04 | 1.33 | 1.23 | 0.81 |
CreateRange\<String> (512) | 0.61 | 0.25 | 2.06 | 0.32 |
CreateRange\<String> (10K) | 0.52 | 0.23 | 2.49 | 0.32 |
AddRange\<Int32> (50) | 0.58 | 0.53 | 1.64 | 0.35 |
AddRange\<Int32> (10K) | 0.43 | 0.11 | 3.30 | 0.24 |
AddRange\<String> (50) | 0.63 | 0.33 | 1.43 | 0.33 |
AddRange\<String> (10K) | 0.58 | 0.23 | 2.77 | 0.32 |
The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree?
I added a basic builder implementation to the prototype. For now the builder just supports indexer get/set and ToBuilder/ToImmutable. However, this should illustrate how the builder can function in general and it meets the performance criteria above.
Benchmarks:
Benchmark | Time Ratio (Prototype) | Bytes Allocated Ratio (Prototype) | Time Ratio (TunnelVision) | Bytes Allocated Ratio (TunnelVision) |
---|---|---|---|---|
BuilderSetItem\<Int32> (5) | 0.38 | 0.56 | 0.88 | 0.56 |
BuilderSetItem\<Int32> (512) | 0.23 | 0.16 | 1.89 | 0.21 |
BuilderSetItem\<Int32> (10K) | 0.27 | 0.12 | 1.73 | 0.20 |
BuilderSetItem\<String> (5) | 0.61 | 0.63 | 0.99 | 0.70 |
BuilderSetItem\<String> (512) | 0.36 | 0.25 | 1.97 | 0.29 |
BuilderSetItem\<String> (10K) | 0.35 | 0.23 | 1.79 | 0.28 |
The benchmark I'm using does the following:
ToBuilder()
ToImmutable()
A few more benchmarks to share:
Benchmark | Time Ratio (Prototype) | Bytes Allocated Ratio (Prototype) | Time Ratio (TunnelVision) | Bytes Allocated Ratio (TunnelVision) |
---|---|---|---|---|
RemoveAt\<Int32> (5) | 0.56 | 0.71 | 1.96 | 0.90 |
RemoveAt\<Int32> (512) | 0.78 | 1.01 | 4.80 | 0.96 |
RemoveAt\<Int32> (10K) | 0.75 | 1.15 | 3.25 | 1.06 |
RemoveAt\<String> (5) | 0.93 | 0.79 | 1.91 | 1.23 |
RemoveAt\<String> (512) | 1.03 | 1.01 | 4.67 | 1.05 |
RemoveAt\<String> (10K) | 0.89 | 1.07 | 3.53 | 1.12 |
RemoveRange\<Int32> (5) | 0.27 | 0.43 | 1.71 | 0.48 |
RemoveRange\<Int32> (512) | 0.01 | 0.02 | 2.40 | 0.20 |
RemoveRange\<Int32> (10K) | 0.00 | 0.00 | 2.25 | 0.20 |
RemoveRange\<String> (5) | 0.57 | 0.48 | 1.87 | 0.67 |
RemoveRange\<String> (512) | 0.01 | 0.03 | 2.62 | 0.28 |
RemoveRange\<String> (10K) | 0.00 | 0.00 | 2.63 | 0.28 |
IndexOf\<Int32> (5) | 0.10 | NaN | 0.20 | NaN |
IndexOf\<Int32> (512) | 0.17 | NaN | 0.87 | NaN |
IndexOf\<Int32> (10K) | 0.17 | NaN | 1.01 | NaN |
IndexOf\<String> (5) | 0.08 | NaN | 0.19 | NaN |
IndexOf\<String> (512) | 0.11 | NaN | 0.45 | NaN |
IndexOf\<String> (10K) | 0.11 | 0.00 | 0.48 | 0.00 |
CopyTo\<Int32> (5) | 0.07 | NaN | 0.34 | NaN |
CopyTo\<Int32> (512) | 0.04 | NaN | 8.58 | NaN |
CopyTo\<Int32> (10K) | 0.03 | NaN | 13.42 | ā |
CopyTo\<String> (5) | 0.10 | NaN | 0.23 | NaN |
CopyTo\<String> (512) | 0.05 | NaN | 2.19 | NaN |
CopyTo\<String> (10K) | 0.05 | NaN | 4.41 | ā |
Some notes:
size
and removes indices at random until the list is emptysize
and removes a single range comprising of the middle third of the list elementsRemoveRange(index, count)
can be implemented very efficiently for a B+ tree. Our RemoveAt(index)
just calls RemoveRange(index, 1)
.IndexOf
and CopyTo
leverage a Span
-based scanning helper like what @AlgorithmsAreCool suggested.Added a stack-based enumerator like what ImmutableList<T>
has. Enumeration is also significantly faster, likely since most MoveNext() calls are just advancing through a leaf node.
Benchmark | Time Ratio (Prototype) | Bytes Allocated Ratio (Prototype) | Time Ratio (TunnelVision) | Bytes Allocated Ratio (TunnelVision) |
---|---|---|---|---|
Enumeration\<Int32> (5) | 0.24 | NaN | 0.51 | NaN |
Enumeration\<Int32> (512) | 0.39 | NaN | 1.87 | NaN |
Enumeration\<Int32> (10K) | 0.35 | NaN | 2.60 | NaN |
Enumeration\<String> (5) | 0.24 | NaN | 0.37 | NaN |
Enumeration\<String> (512) | 0.34 | NaN | 0.76 | NaN |
Enumeration\<String> (10K) | 0.31 | 0.00 | 0.95 | 1.00 |
I haven't studied the perf numbers, but wanted to respond to one of the questions about memory allocations: people use the builders to reduce allocations and improve perf of creating a structure that will be frozen later. Allocating and copying an array on every Add would defeat that. For that reason I support having 'slack' in the array at least during the builder's use. At that point then, we need to decide whether it's better to have very fast and alloc-free ToImmutable or to remove the slack from the arrays. I'm not sure on this. I'm leaning slightly towards cutting the slack.
For that reason I support having 'slack' in the array at least during the builder's use. At that point then, we need to decide whether it's better to have very fast and alloc-free ToImmutable or to remove the slack from the arrays. I'm not sure on this. I'm leaning slightly towards cutting the slack.
@AArnott can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations?
The topic of how the builder should work is one I've given a bit of thought to and I think there are some interesting considerations / options / tradeoffs worth mentioning.
First off, consider the approach where there is no slack (arrays are always full). While this might seem like it defeats the point of having a builder, it still offers real benefits:
Next, consider the approach where we do have slack to avoid recreating leaf arrays on every change:
Finally, I'll note a couple of "in-between" options that might be worth thinking about:
Sorry for the big wall of text, but curious to hear your thoughts, especially what specific factors / benchmarks should influence the decision.
can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations?
Fewer allocations in the Builder demand slack. ToImmutable's lean result demands no slack. So the direction I'm leaning is to create arrays with slack in the Builder, and then trim the slack away during ToImmutable. It wins here and loses there, but as Builders are used to initialize large collections, I think these are the right trade-offs.
I just finished reading your last comment. Great points. I wonder how hard it would be to code to allow for either possibility (slack or no during mutation). Maybe the policy could allow for a few mutations with no slack. Within one Builder's lifetime if mutations keep coming, we change the optimizing assumption and start leaving slack in arrays... just an idea. It's not worth it if it adds so much complexity that bugs or inadequate testing result.
I guess considering arrays at each node wouldn't exceed some (small) length (right?), allocating another array and copying elements over might not be too bad. But it would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces). Maybe you've already collected that data above... I haven't studied the tables.
this approach would increase the complexity of the builder code and may make it more difficult to share code between the builder and the immutable list. Not sure how much of a factor this should be.
If there is a significant perf win, and we have adequate tests for all the code, if the code isn't all shared, that feel ok to me.
We could have "slack" only at the leading edge of the tree to optimize for Add() specifically (I already have an optimization like this for the immutable version which is why Add() is faster and incurs fewer allocations than Insert()).
Such optimizations sound great. And then there's AddRange(IEnumerable) which presumably should allow you to perform optimally (particularly if a Count property is available via conditional casting).
Note that the implementation I worked on has different slack characteristics. While the implementation uses fixed-size inline arrays, it separately guarantees minimum slack exists in cases where a list is constructed from beginning to end (e.g. starting from empty and calling Add
/AddRange
).
separately guarantees minimum slack exists in cases where a list is constructed from beginning to end
By this do you mean that the resulting B+ tree is "compact" in the sense that a naive implementation of Add
would result in lots of half-full nodes but instead you are careful to construct it such that as many nodes as possible are full?
If so, that's also what I'm doing in the array-based implementation although I haven't built this out for the builder yet.
It would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces)
@AArnott are you imagining randomized insertion or Add
? As mentioned above Add
will be able to take advantage of some other optimizations and shouldn't be as impacted by how we handle slack.
@AArnott ok I had some time to code up Builder.Add() with allowance for slack in the leaves to reduce allocations. Here are the numbers:
Benchmark | Time Ratio (Prototype) | Bytes Allocated Ratio (Prototype) | Time Ratio (TunnelVision) | Bytes Allocated Ratio (TunnelVision) |
---|---|---|---|---|
BuilderAdd\<Int32> (15K) | 0.15 | 0.21 | 1.45 | 0.19 |
BuilderAdd\<String> (15K) | 0.25 | 0.34 | 1.60 | 0.28 |
Interestingly, without the slack optimization the timing numbers are comparable, but the allocation numbers are worse (ratio ~2.0 instead of ~0.25 like we see here). Therefore, I think having slack in the leaves is a good build for the builder. There's a good amount of room to tune as well. Right now we always expand the node to maximum size but we could instead do something more conservative like going to max / 2 on the first expansion which might be better for insertions. We could also switch from a no-slack mode to a slack mode based on a number of expansions performed like you mentioned. I'm not sure either of those is worth exploring now but it could make sense in the future.
Those numbers look good. I prefer the "prototype" columns to the "TunnelVision" column as a few % points of more memory seems totally worth being several times faster.
@AArnott note that the time ratio for my branch is primarily caused by the fact that the inline 8 element array search has not been vectorized and uses a coding pattern in the indexer which the JIT doesn't know how to optimize. It is designed to support branchless lookups on machines with 256-bit registers at each level of index once that work is complete.
@sharwell are you referring to something that would optimize these switch statements?
If so, could Unsafe
be used to avoid branching in the meantime? Something like this:
static ref T Item<T>(ref FixedArray8<T> array, int index)
{
if ((uint)index >= 8) { throw new IndexOutOfRangeException(); }
return ref Unsafe.Add(ref Unsafe.As<FixedArray8<T>, T>(ref array), index);
}
Not sure if this is making some invalid assumptions about layout.
@sharwell are you referring to something that would optimize these switch statements?
That's part of the issue, but it can go much further. For example, an 8-element binary search could be implemented in a completely vectored form based on code similar to https://fioraaeterna.tumblr.com/post/26644202378/fast-binary-searching.
If so, could
Unsafe
be used to avoid branching in the meantime? Something like this:static ref T Item<T>(ref FixedArray8<T> array, int index) { if ((uint)index >= 8) { throw new IndexOutOfRangeException(); } return ref Unsafe.Add(ref Unsafe.As<FixedArray8<T>, T>(ref array), index); }
Not sure if this is making some invalid assumptions about layout.
This was implemented in https://github.com/tunnelvisionlabs/dotnet-trees/pull/102, but there are concerns that this may violate layout assumptions made by the JIT. Switching to Vector256<int>
would resolve this in a different way.
I want to poke this issue to see how I can help move this towards the finish line. There is some great work that has been done so far and I think folks will benefit from it if we can get it there.
I'm assuming more comprehensive testing needs to be done, but are there still outstanding design questions?
Without re-reading the whole thread, IIRC the issues are:
The time required can be hard to come by, but your pinging here helps us understand community demand for it so we can prioritize accordingly.
Currently the
Node
class used forImmutableList<T>
includes 40 bytes of overhead per list element on X64, and 24 bytes of overhead on X86. This means the memory cost of using this data structure for anImmutableList<int>
is 1000% on X64, and 600% on X86 (these figures represent the amount of memory used for storing non-data elements associated with the structure).I propose migrating to a B+-tree instead, using a fixed value for b and declaring data values in the index and data pages directly (to avoid unnecessary memory overhead for arrays). The choice of b primarily affects the following things:
In addition to the above, B+-trees provide a substantial memory savings by removing the child pointers (the Left and Right pointers for AVL nodes) from the leaves.
My initial calculations (which could be completely wrong) indicate that the average memory allocation required for a single mutation operation on a list of 100000 elements using the AVL structure is 775 bytes for X64 and 493 bytes for X86. The optimal value of b for X64 in terms of mutation allocation overhead is b=8, which costs 476 bytes per operation on a list of the same size (39% reduction). For X86 this same value of b costs 388 bytes per mutation (21% reduction, and b=6 would be optimal with 382 bytes). The overall storage overhead for b=8 drops a staggering amount down to 169% for X64 and 119% for X86 (counting the average page at 75% capacity, based on a merge at 50% and split at 100%).