dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.62k stars 121 forks source link

`PoolingBufferWriter` class and `MemoryAllocator.GetArrayAllocator<T>()` method to use regular arrays instead of pooling. `MemoryAllocator<T>` #192

Closed adam-dot-cohen closed 11 months ago

adam-dot-cohen commented 1 year ago

@adam-dot-cohen , it's better to continue conversation to dotnext repo. However, you can already achieve what you want with PoolingBufferWriter class and MemoryAllocator.GetArrayAllocator<T>() method to use regular arrays instead of pooling. Also, you can provide your own MemoryAllocator<T> delegate to override allocation strategy.

Originally posted by @sakno in https://github.com/dotnet/runtime/issues/91728#issuecomment-1739454693


Thanks, will close out the DotNet issue and take a look at PoolingBufferWriter. If you have any thoughts off the cuff thoughts on why the performance and allocations of PoolingBufferWriter aren't inline with the implementation I'd posted, it'd be appreciated. If not, I'll take a look when I have time.

sakno commented 1 year ago

There is benchmark you can check out to compare buffer types. PoolingArrayBufferWriter is almost the same as PoolingBufferWriter with the exception of buffer types. PoolingArrayBufferWriter can work with ArrayPool<T> while PoolingBufferWriter can work with any custom MemoryAllocator<T>, including unmanaged memory pool.

Do you have source code of your benchmark? The main difference is how overflow handled. In case of DotNext, it doubles the capacity on every overflow, just like List<T>. You implementation uses very high DefaultGrowthIncrement value, while instantiated PoolingArrayBufferWriter class using parameterless constructor doesn't preallocate the buffer. It's fair to set its capacity to 512 as well as for your implementation.

sakno commented 1 year ago

After deep analysis of your implementation, I found the following

adam-dot-cohen commented 1 year ago

I appreciate the through code review. I've updated the code per your suggestions (comments below).

Regarding the growth increment, I modified the approach to start at 1 and length * 2 loosely modeled after List<T>. It definitely slowed things down a bit, but still faster than all with the exception of High Performance Toolkit's ArrayPoolBufferWriter on the count of 100 test. The results are down at the bottom.

I'm going to make a pass at using the quant side of my brain when I get time over the next couple of days and add a struct to capture streaming statistics (mean, variance, std dev, etc.) to see if I can better optimize. Will also fork your repo and add my implementation to your benchmarks just to kick the tires.

Link to revised implemention.

  • Inside of Advance method: this._index + count may throw OverflowException

checked added

  • Inside of GetMemory as well as GetSpan: this._index += sizeHint; may overflow

checked via call to Grow

  • Inside of Grow method: this._index + length and this._index + this._growthIncrement may overflow

checked added

  • Shared.Return must take into account blittable types. It means that the second optional parameter must be set to RuntimeHelpers.IsReferenceOrContainsReferences<T> check to avoid memory leaks

added to Grow and Dispose

  • _disposed is not checked across all public methods

added

  • Array pool cannot be customized. Ctor accepts pool only for initial buffer, but not for subsequent allocations. ArrayPool<T>.Shared.Rent is obviously inlined by JIT. pool should be stored as private field instead.

careless oversight...resolved

New Implementation

|                              Method | TotalCount |       Mean |     Error | Ratio | Allocated | Alloc Ratio |
|------------------------------------ |----------- |-----------:|----------:|------:|----------:|------------:|
|       'Proposed MemoryBufferWriter' |        100 |   2.252 us | 0.0655 us |  1.00 |     896 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |        100 |   1.953 us | 0.0725 us |  0.87 |     880 B |        0.98 |
|        'DotNext PooledBufferWriter' |        100 |   7.572 us | 0.1542 us |  3.37 |    1232 B |        1.38 |
|   'DotNext PooledArrayBufferWriter' |        100 |   6.943 us | 0.1426 us |  3.09 |    1032 B |        1.15 |
|         'MS RecyclableMemoryStream' |        100 |   7.127 us | 0.1458 us |  3.19 |     776 B |        0.87 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |       1000 |   2.285 us | 0.0540 us |  1.00 |     896 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |       1000 |   2.509 us | 0.0545 us |  1.10 |     880 B |        0.98 |
|        'DotNext PooledBufferWriter' |       1000 |   8.013 us | 0.2233 us |  3.53 |    1232 B |        1.38 |
|   'DotNext PooledArrayBufferWriter' |       1000 |   7.403 us | 0.1518 us |  3.25 |    1032 B |        1.15 |
|         'MS RecyclableMemoryStream' |       1000 |   7.462 us | 0.1531 us |  3.29 |     776 B |        0.87 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |      10000 |   2.849 us | 0.0612 us |  1.00 |     896 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |      10000 |   4.043 us | 0.0823 us |  1.42 |     880 B |        0.98 |
|        'DotNext PooledBufferWriter' |      10000 |   9.168 us | 0.1811 us |  3.23 |    1232 B |        1.38 |
|   'DotNext PooledArrayBufferWriter' |      10000 |   8.417 us | 0.1705 us |  2.96 |    1032 B |        1.15 |
|         'MS RecyclableMemoryStream' |      10000 |  11.461 us | 0.2311 us |  4.03 |     776 B |        0.87 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |     100000 |   8.311 us | 0.1430 us |  1.00 |     896 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |     100000 |  14.310 us | 0.2596 us |  1.72 |     880 B |        0.98 |
|        'DotNext PooledBufferWriter' |     100000 |  21.938 us | 0.4278 us |  2.63 |    1232 B |        1.38 |
|   'DotNext PooledArrayBufferWriter' |     100000 |  19.286 us | 0.3480 us |  2.32 |    1032 B |        1.15 |
|         'MS RecyclableMemoryStream' |     100000 |  45.471 us | 0.6539 us |  5.46 |     776 B |        0.87 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |    1000000 |  61.865 us | 0.1369 us |  1.00 |     896 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |    1000000 | 109.963 us | 1.5778 us |  1.77 |     880 B |        0.98 |
|        'DotNext PooledBufferWriter' |    1000000 | 144.947 us | 2.7834 us |  2.34 |    1232 B |        1.38 |
|   'DotNext PooledArrayBufferWriter' |    1000000 | 127.936 us | 0.8402 us |  2.06 |    1032 B |        1.15 |
|         'MS RecyclableMemoryStream' |    1000000 | 383.564 us | 2.0380 us |  6.19 |     776 B |        0.87 |

512 Increment

|                              Method | TotalCount |       Mean |     Error | Ratio | Allocated | Alloc Ratio |
|------------------------------------ |----------- |-----------:|----------:|------:|----------:|------------:|
|       'Proposed MemoryBufferWriter' |        100 |   1.815 us | 0.0856 us |  1.00 |     992 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |        100 |   1.841 us | 0.0808 us |  1.03 |     976 B |        0.98 |
|        'DotNext PooledBufferWriter' |        100 |   6.345 us | 0.3128 us |  3.56 |    1328 B |        1.34 |
|         'MS RecyclableMemoryStream' |        100 |   6.086 us | 0.2201 us |  3.43 |     872 B |        0.88 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |       1000 |   2.192 us | 0.0912 us |  1.00 |     992 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |       1000 |   2.248 us | 0.0632 us |  1.03 |     976 B |        0.98 |
|        'DotNext PooledBufferWriter' |       1000 |   6.730 us | 0.3361 us |  3.12 |    1328 B |        1.34 |
|         'MS RecyclableMemoryStream' |       1000 |   6.458 us | 0.2275 us |  2.99 |     872 B |        0.88 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |      10000 |   2.437 us | 0.0906 us |  1.00 |     992 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |      10000 |   3.628 us | 0.1447 us |  1.51 |     976 B |        0.98 |
|        'DotNext PooledBufferWriter' |      10000 |   8.089 us | 0.3166 us |  3.37 |    1328 B |        1.34 |
|         'MS RecyclableMemoryStream' |      10000 |   9.856 us | 0.2774 us |  4.09 |     872 B |        0.88 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |     100000 |   7.320 us | 0.1485 us |  1.00 |     992 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |     100000 |  13.107 us | 0.2647 us |  1.79 |     976 B |        0.98 |
|        'DotNext PooledBufferWriter' |     100000 |  19.373 us | 0.3802 us |  2.63 |    1328 B |        1.34 |
|         'MS RecyclableMemoryStream' |     100000 |  43.500 us | 0.8456 us |  5.92 |     872 B |        0.88 |
|                                     |            |            |           |       |           |             |
|       'Proposed MemoryBufferWriter' |    1000000 |  54.994 us | 1.0460 us |  1.00 |     992 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |    1000000 | 105.674 us | 1.7592 us |  1.93 |     976 B |        0.98 |
|        'DotNext PooledBufferWriter' |    1000000 | 140.283 us | 0.3657 us |  2.55 |    1328 B |        1.34 |
|         'MS RecyclableMemoryStream' |    1000000 | 380.883 us | 1.8720 us |  6.92 |     872 B |        0.88 |
sakno commented 1 year ago

I think your repo is private, I can't observe the source code.

adam-dot-cohen commented 1 year ago

I can't seem to make it public. Tried toggling the visibility settings a few times and it doesn't seem to have any effect. I made you a collaborator.

sakno commented 12 months ago

It makes no sense to cache _buffer field because you anyway slice over it. It actually points to the same object as _array.

fixed (T* item = items) no need to do this because an array is convertible to Span<T> without pinning.

Memory<T> IMemoryOwner<T>.Memory => this._buffer; you still need to slice the buffer. Otherwise, the returned buffer is larger and may contain random data. Rent doesn't give any guarantees about how each element in the rented array initialized.

if(checked(this._index + count) <= this._buffer.Span.Length) this.Grow(count); why growth happened when the buffer has enough space to place count elements in it? Probably, >= should be used instead of <=

It's better to use this method in the benchmark instead of your own to append the buffer.

sakno commented 12 months ago
var optimal = (long)this._buffer.Span.Length * DefaultGrowthMultiple;

        if (optimal > int.MaxValue) optimal = int.MaxValue;

What if Length is already equal to int.MaxValue? In this case, optimal become int.MaxValue again without actual growth? Yes, it is corner case but it should be considered.

sakno commented 12 months ago

public void Advance(int count) here count is not validated for < 0.

CodingMadness commented 11 months ago

How is this thread continuing? Any news here on so far?

sakno commented 11 months ago

@Shpendicus , proposed implementation requires careful code review, add all necessary checks and take all corner cases into account. After that, I think the performance will be the same.

CodingMadness commented 11 months ago

@Shpendicus , proposed implementation requires careful code review, add all necessary checks and take all corner cases into account. After that, I think the performance will be the same.

makes sense yes, i doubt that it would be better after all these additions.

CodingMadness commented 11 months ago

@adam-dot-cohen did you consider/implement all the proposed changes from sakno?

sakno commented 11 months ago

Also, it's hard to measure Write method alone. The current benchmark measures instantiation of the class as well. PooledArrayBufferWriter implements much more interfaces than proposed implementation. Instantiation cost depends on the number of implemented interfaces.

adam-dot-cohen commented 11 months ago

It makes no sense to cache _buffer field because you anyway slice over it. It actually points to the same object as _array.

This was intentional given the performance improvement versus Array.Copy() - and underlying best practices such as type validation that are backed into Memory<T>.Span. The instance variable of the array was in the class was there primarily to be able to return it to the pool.

I kicked the tires and ran a quick benchmark in Linqpad and realized using new Span<T>(_array, startIndex, length) over the array instance variable for copy operations is even faster by ~20%...

image

You can find the the code here. If you happen to use Linqpad, just open the script, highlight the methods with your mouse and hit CTRL + SHIFT + B to run BenchmarkDotNet inside Linqpad.

I modified the implementation in my repo to use the Span<T> copy method in LP script, and with the BenchmarkDotNet settings you recommended....purely as an exercise to see if it would be worth the effort to modify PooledArrayBufferWriter which is clearly more robust. Results at bottom of post.

Please keep in mind that my implementation is an experimental "toy" implementation. It doesn't have tests around it and doesn't have a ton of thought behind it.

fixed (T* item = items) no need to do this because an array is convertible to Span<T> without pinning.

Agreed, the implementation was modified. For some reason the link above referenced an old commit. My apologies, I've been an executive for 20 years and just started coding again - still getting up to speed on this Git, DotNet Core, etc...

Memory<T> IMemoryOwner<T>.Memory => this._buffer; you still need to slice the buffer. Otherwise, the returned buffer is larger and may contain random data. Rent doesn't give any guarantees about how each element in the rented array initialized.

Again, this is/was resolved. My latest commit that hadn't push - probably due to user error.

if(checked(this._index + count) <= this._buffer.Span.Length) this.Grow(count); why growth happened when the buffer has enough space to place count elements in it? Probably, >= should be used instead of <=

Agreed, this was resolved as well. I also modified the growth algorithm. The implementation is inspired by ArrayPoolWriter but I wrote a custom implementation that's simplified and slightly faster than the the framework implementation. It grows the buffer exponentially by powers of 2 while drastically reducing allocations.

It's better to use this method in the benchmark instead of your own to append the buffer.

Will give this a try when I get a chance. Re: the instantiation comment in your last post, the benchmark class in my Git repo (which is now public) is a copy of your MemoryStreamingBenchmarks, which also measure instantiation. It's likely good to include all things considered to provide a "real-world" trial of what would be expected production usage - and to flag potential optimizations. Here are the updated benchmarks with the Memory swapped out for Span copy over instance array variable, with the updated settings...

|                              Method | TotalCount |          Mean |      Error | Ratio |   Gen0 |   Gen1 |   Gen2 | Allocated | Alloc Ratio |
|------------------------------------ |----------- |--------------:|-----------:|------:|-------:|-------:|-------:|----------:|------------:|
|       'Proposed MemoryBufferWriter' |        100 |      55.01 ns |   0.524 ns |  1.00 | 0.0030 |      - |      - |      40 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |        100 |      61.34 ns |   0.427 ns |  1.12 | 0.0030 |      - |      - |      40 B |        1.00 |
|         'MS RecyclableMemoryStream' |        100 |     359.48 ns |   7.166 ns |  6.31 | 0.0210 |      - |      - |     272 B |        6.80 |
|   'DotNext PooledArrayBufferWriter' |        100 |     693.37 ns |  15.038 ns | 12.39 | 0.0257 | 0.0248 |      - |     344 B |        8.60 |
|        'DotNext PooledBufferWriter' |        100 |   1,088.32 ns |  21.910 ns | 19.79 | 0.0191 | 0.0172 | 0.0038 |     241 B |        6.03 |
|                                     |            |               |            |       |        |        |        |           |             |
|       'Proposed MemoryBufferWriter' |       1000 |      94.56 ns |   0.849 ns |  1.00 | 0.0030 |      - |      - |      40 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |       1000 |     195.63 ns |   0.671 ns |  2.07 | 0.0029 |      - |      - |      40 B |        1.00 |
|         'MS RecyclableMemoryStream' |       1000 |     555.60 ns |   5.997 ns |  5.88 | 0.0210 |      - |      - |     272 B |        6.80 |
|   'DotNext PooledArrayBufferWriter' |       1000 |     640.85 ns |  19.352 ns |  7.46 | 0.0257 | 0.0248 |      - |     344 B |        8.60 |
|        'DotNext PooledBufferWriter' |       1000 |   1,161.58 ns |  24.946 ns | 11.89 | 0.0229 | 0.0210 | 0.0057 |     248 B |        6.20 |
|                                     |            |               |            |       |        |        |        |           |             |
|       'Proposed MemoryBufferWriter' |      10000 |     497.44 ns |   0.947 ns |  1.00 | 0.0029 |      - |      - |      40 B |        1.00 |
| 'High Perf Toolkit ArrayPoolWriter' |      10000 |     913.76 ns |   1.347 ns |  1.84 | 0.0029 |      - |      - |      40 B |        1.00 |
|   'DotNext PooledArrayBufferWriter' |      10000 |   1,287.86 ns |  25.486 ns |  2.56 | 0.0248 | 0.0229 |      - |     344 B |        8.60 |
|        'DotNext PooledBufferWriter' |      10000 |   1,926.01 ns |  37.601 ns |  3.85 | 0.0191 | 0.0153 | 0.0076 |     292 B |        7.30 |
|         'MS RecyclableMemoryStream' |      10000 |   2,523.90 ns |   5.649 ns |  5.07 | 0.0191 |      - |      - |     272 B |        6.80 |
|                                     |            |               |            |       |        |        |        |           |             |
|       'Proposed MemoryBufferWriter' |     100000 |   4,438.10 ns |   7.004 ns |  1.00 |      - |      - |      - |      40 B |        1.00 |
|   'DotNext PooledArrayBufferWriter' |     100000 |   7,790.30 ns |  13.179 ns |  1.76 | 0.0153 |      - |      - |     344 B |        8.60 |
| 'High Perf Toolkit ArrayPoolWriter' |     100000 |   8,000.33 ns |  70.424 ns |  1.80 |      - |      - |      - |      40 B |        1.00 |
|        'DotNext PooledBufferWriter' |     100000 |   9,329.14 ns |  24.509 ns |  2.10 | 0.0153 |      - |      - |     353 B |        8.82 |
|         'MS RecyclableMemoryStream' |     100000 |  21,751.60 ns |  47.945 ns |  4.90 |      - |      - |      - |     272 B |        6.80 |
|                                     |            |               |            |       |        |        |        |           |             |
|       'Proposed MemoryBufferWriter' |    1000000 |  43,444.78 ns |  33.357 ns |  1.00 |      - |      - |      - |      40 B |        1.00 |
|   'DotNext PooledArrayBufferWriter' |    1000000 |  75,537.09 ns | 643.036 ns |  1.74 |      - |      - |      - |     344 B |        8.60 |
| 'High Perf Toolkit ArrayPoolWriter' |    1000000 |  80,848.85 ns | 691.750 ns |  1.86 |      - |      - |      - |      40 B |        1.00 |
|        'DotNext PooledBufferWriter' |    1000000 |  86,916.75 ns | 508.096 ns |  2.00 |      - |      - |      - |     355 B |        8.88 |
|         'MS RecyclableMemoryStream' |    1000000 | 213,086.38 ns | 252.987 ns |  4.90 |      - |      - |      - |     272 B |        6.80 |
adam-dot-cohen commented 11 months ago

It's better to use this method in the benchmark instead of your own to append the buffer.

Understood. That said, the primary use case I was focused on for my purposes was writing chunks of Span and Memory to extend this serializer that I wrote to support full object graph with the same 10-20x performance improvement over MemoryPack, MessagePack and Protobuf.

Disclaimer: HyperSerializer started out as a quick and dirty champion challenger versus C++. The code isn't winning any beauty pageants.

sakno commented 11 months ago

array instance variable for copy operations is even faster by ~20%...

I've done this for PooledArrayBufferWriter<T> as well by replacing Array.Copy operations. It drops 7 ns, but it is not 20% from the overall operation time. It can be done without Span<T> ctor to skip some checks that already done internally.

sakno commented 11 months ago

BufferExtensions.Write as a primary target for benchmark is crucial here. It accepts IBufferWriter<T> as interface, which means that runtime cannot devirtualize calls to Advance and GetSpan while your declared Write method can (because it is in sealed class).

adam-dot-cohen commented 11 months ago

array instance variable for copy operations is even faster by ~20%...

I've done this for PooledArrayBufferWriter<T> as well by replacing Array.Copy operations. It drops 7 ns, but it is not 20% from the overall operation time. It can be done without Span<T> ctor to skip some checks that already done internally.

The above results were net7.0. Surprisingly, there's a wider margin in net6.0. They must have made some performance improvements to heap management, GC, compiler etc. to get gain of 10% w/7.0 for free. The Array class doesn't appear to have any changes that would yield this much upside...

image

sakno commented 11 months ago

DotNext 4.15.0 has been published with some proposed optimizations that are applicable to types mentioned above. Feel free to reopen this issued if you have open questions.