Open macrogreg opened 5 months ago
Thanks @macrogreg this might be a better place and you've written the question much better than I did
If there are no current benchmarks, I would be happy to take care of that myself, I'll use BenchmarkDotnet to test various situations
In fact, looking at the implementation of Stream.ReadAsync(Memory
buffer, ...) I see that it delegates to the byte-array based implementation, plus does some (minor) additional stuff.
Isn't this reason alone to prefer the Memory
-based implementation? It is handling the different scenarios for you (it doesn't just defer to the byte[]
version, but has its own implementation with array pooling if the Memory<byte>
isn't just a byte[]
). I agree that it is good to understand what it is doing and perhaps the documentation could be improved there.
As far as I can tell, that is the point behind Memory<T>
itself. It is just a wrapper for byte[]
except for when it isn't. It provides more flexibility and allows for a standard or unform interface for scenarios that previously might be different.
I think this method and the preference for it would just come out of that.
In fact, looking at the implementation of Stream.ReadAsync(Memory buffer, ...) I see that it delegates to the byte-array based implementation, plus does some (minor) additional stuff.
Isn't this reason alone to prefer the
Memory
-based implementation? It is handling the different scenarios for you (it doesn't just defer to thebyte[]
version, but has its own implementation with array pooling if theMemory<byte>
isn't just abyte[]
). I agree that it is good to understand what it is doing and perhaps the documentation could be improved there.As far as I can tell, that is the point behind
Memory<T>
itself. It is just a wrapper forbyte[]
except for when it isn't. It provides more flexibility and allows for a standard or unform interface for scenarios that previously might be different.I think this method and the preference for it would just come out of that.
Unfortunately that doesn't match up with what the rule says; it says that if one starts with a byte[]
one should then wrap that in a Memory<byte>
and use those overloads; because they use memory more efficiently
It's clear that they do not. But again, when I filed the original issue I was giving the tool the benefit of the doubt and asking for proof.
In fact, looking at the implementation of Stream.ReadAsync(Memory buffer, ...) I see that it delegates to the byte-array based implementation, plus does some (minor) additional stuff.
Isn't this reason alone to prefer the
Memory
-based implementation? It is handling the different scenarios for you [...] [...] I think this method and the preference for it would just come out of that.
(Sorry, folks, a notification about an answer on this thread got misplaced, so a few weeks have passed.)
@bzd3y, I cannot agree. The current logic in Stream
's Memory
-based overload appears to be (code linked above):
If the Memory
buffer is backed by an array, then delegate to the read-method that directly uses that array.
-- That degrades rather than improving performance, albeit very slightly. If I know I have an array, I could have called the right overload directly, saving a few instructions.
If the Memory
buffer is backed by something that is not an array, then it "rents" an array from the ArrayPool
, delegates to the very same array-based method and then copies (!) the data from the rented array to the destination Memory
. So basically, the entire set of read bytes is copied.
-- If the developer had an option to use an array, they would have been able to avoid that entire copy. This goes against the intuition that if I already have a Memory
buffer, the Read API will attempt to use it directly rather than completely ignoring it, using a pooled temp array and then doing a complete memory copy. I realize the difficulties around using a generic Memory
instance when calling an underlying native async IO API in a GC-environment. However, the docs suggesting a perf benefit surely make one believe that some kind of effort for some "magic" is being done.
In summary, the Memory
-based overload seems to have a benefit only in the scenario where the user has a Memory
instance not back by an array. In those cases is does the right thing around array pooling.
In the much more common case there the user has a Memory
instance that is backed by an array, the Memory
-based API has no perf benefit at all over the array-based overload, but it provides for briefer code, which is still a benefit.
In cases where the user has direct access to an array already, the Memory
-based overload actually has a small extra cost and no benefit.
I would love to be shown to be wrong to that I can learn, but looking at the code this is what I conclude. So, it would be great to get some eyes form the right dev team on this. :)
- If the
Memory
buffer is backed by something that is not an array, then it "rents" an array from theArrayPool
, delegates to the very same array-based method and then copies (!) the data from the rented array to the destinationMemory
. So basically, the entire set of read bytes is copied.
This seals the deal for me. I hadn't looked that deeply into the implementation, but it's clear now that this rule is bad advice.
I suspect that the rule came about because people were excited about the new memory stuff and wanted to push adoption.
We need to prove what the actual performance effects are empirically. I'd be happy to create a demo using BenchmarkDotNet if there isn't some other standard way of doing things on this project?
- If the
Memory
buffer is backed by something that is not an array, then it "rents" an array from theArrayPool
, delegates to the very same array-based method and then copies (!) the data from the rented array to the destinationMemory
. So basically, the entire set of read bytes is copied.This seals the deal for me. I hadn't looked that deeply into the implementation, but it's clear now that this rule is bad advice.
Yes, this is the respective code in the current main
branch.
I suspect that the rule came about because people were excited about the new memory stuff and wanted to push adoption.
We need to prove what the actual performance effects are empirically. I'd be happy to create a demo using BenchmarkDotNet if there isn't some other standard way of doing things on this project?
That would be great, @gavinBurtonStoreFeeder. On the other hand, I'd love to get some acknowledgement / guidance from the owning engineering team, otherwise our effort may be in vain. :)
- If the
Memory
buffer is backed by something that is not an array, then it "rents" an array from theArrayPool
, delegates to the very same array-based method and then copies (!) the data from the rented array to the destinationMemory
. So basically, the entire set of read bytes is copied.This seals the deal for me. I hadn't looked that deeply into the implementation, but it's clear now that this rule is bad advice.
Yes, this is the respective code in the current
main
branch.I suspect that the rule came about because people were excited about the new memory stuff and wanted to push adoption. We need to prove what the actual performance effects are empirically. I'd be happy to create a demo using BenchmarkDotNet if there isn't some other standard way of doing things on this project?
That would be great, @gavinBurtonStoreFeeder. On the other hand, I'd love to get some acknowledgement / guidance from the owning engineering team, otherwise our effort may be in vain. :)
I'll do it over the weekend, on my personal GH account, to avoid any IP issues with my employer
I didn't get to it this weekend but will do it soon.
@macrogreg @gavinBurtonStoreFeeder Sorry, I chimed in here and then lost track and never came back. So if it has been long enough that you just want to ignore this, then that's fine.
I've gone back and forth on this, when I looked at it more closely and saw the copy, and actually it seems two copies, I was ready to say "Oh, good point." But thinking about it more, I'm not convinced.
This may be more of an ask to change my mind than me trying to change yours, but I don't think the copies can be avoided in the case of a non-byte[]
-backed Memory<byte>
. Here, reading implies copying, right? You don't have a byte[]
backed Memory<byte>
so it can't just give you one and avoid having to copy the byte[]
it reads into the Memory<buffer>
that you gave it that is backed by something other than byte[]
...
It has to work with what you gave it, what you passed in. If that was a byte[]
-backed Memory<byte>
then it just calls the array-based method, and that is done once. I think the performance hit (like function call overhead or whatever) is minimal there. Any performance gain/efficiency claim probably comes from the way Memory<>
works in conjunction with ArraySegment<>
, but you aren't necessarily wrong that it would help if that was explained/documented/substantiated.
But if you didn't give it something like that, then it is going to be "smart" about it and do some work for you that you would have to do anyway. After all, if you didn't pass in a byte[]
-backed Memory<byte>
in the first place, then you probably can't use the byte[]
overload of ReadAsync()
anyway, right? It would return a byte[]
of the data being read (one copy) and then you are just going to have to copy that to whatever kind of buffer is being used that would result in a non-byte[]
backed Memory<byte>
in the first place.
So the TL;DR might be:
SomewhereElse
into our memory.byte[]
, but if you passed in something that isn't one, then that requires another copy to get it there.Or am I missing something? And that isn't to argue that this rule is necessary. I'm not sold on that either. I'm just not sure that it is wrong.
I must apologise that I never got round to doing the benchmarked example
Essentially I'm challenging the claim that if one has started with a byte[], then wrapping it in a Memory
before calling ReadAsync
or WriteAsync
will grant a performance boost.
The rule triggers if one starts with a byte[]
- my suspicion being that the rule is overzealous.
If somebody else wants to actually prove the difference, I would appreciate it. For now, I've disabled the rule as being unsubstantiated unless somebody can find some actual evidence.
Addendum: The exact claim says that the rule "suggests using the memory-based method overloads instead, because they are more efficient"
https://learn.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1835#cause
Update: I decided to just do it now (although I don't have time to setup a git repo etc) This is the code:
using System;
using System.IO;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[MemoryDiagnoser]
public class StreamWriteBenchmark
{
private readonly byte[] data;
private readonly MemoryStream memoryStream;
public StreamWriteBenchmark()
{
// Pre-allocate the byte array with 10,000 bytes
data = new byte[10_000];
memoryStream = new MemoryStream(); // MemoryStream to write to
}
// Benchmark using the byte[] overload of WriteAsync
[Benchmark(Baseline=true)]
public async Task WriteUsingByteArray()
{
memoryStream.Seek(0, SeekOrigin.Begin); // Reset the stream position
await memoryStream.WriteAsync(data, 0, data.Length);
}
// Benchmark using the Memory<byte> overload of WriteAsync
[Benchmark]
public async Task WriteUsingMemory()
{
memoryStream.Seek(0, SeekOrigin.Begin); // Reset the stream position
var memoryData = new Memory<byte>(data); // Wrap the byte array with Memory<byte>
await memoryStream.WriteAsync(memoryData);
}
// Main method to run the benchmark
public static void Main(string[] args)
{
_ = BenchmarkRunner.Run<StreamWriteBenchmark>();
}
}
Update:
I added another test case for Buffer.AsMemory
in case this works differently
Test is running
[Benchmark]
public async Task WriteUsingAsMemory()
{
memoryStream.Seek(0, SeekOrigin.Begin); // Reset the stream position
await memoryStream.WriteAsync(data.AsMemory(0, data.Length));
}
I have results
I've run it a few times, in one instance Memory was faster by 2ns (167 vs 169ns)
I added memory profiling and got this - this is one of the more extreme cases of byte[] being faster
I welcome a little bit of peer review on the methodology and results
// * Detailed results *
StreamWriteBenchmark.WriteUsingByteArray: DefaultJob
Runtime = .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 155.817 ns, StdErr = 0.898 ns (0.58%), N = 48, StdDev = 6.223 ns
Min = 138.191 ns, Q1 = 155.648 ns, Median = 156.550 ns, Q3 = 157.592 ns, Max = 169.726 ns
IQR = 1.944 ns, LowerFence = 152.731 ns, UpperFence = 160.508 ns
ConfidenceInterval = [152.665 ns; 158.970 ns] (CI 99.9%), Margin = 3.153 ns (2.02% of Mean)
Skewness = -1.18, Kurtosis = 5.34, MValue = 2
-------------------- Histogram --------------------
[138.186 ns ; 142.681 ns) | @@@@@
[142.681 ns ; 147.176 ns) |
[147.176 ns ; 150.717 ns) |
[150.717 ns ; 154.650 ns) | @
[154.650 ns ; 159.145 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[159.145 ns ; 163.648 ns) | @@@@
[163.648 ns ; 170.999 ns) | @@@
---------------------------------------------------
StreamWriteBenchmark.WriteUsingMemory: DefaultJob
Runtime = .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 170.139 ns, StdErr = 0.899 ns (0.53%), N = 22, StdDev = 4.219 ns
Min = 165.462 ns, Q1 = 167.172 ns, Median = 168.683 ns, Q3 = 172.273 ns, Max = 181.800 ns
IQR = 5.101 ns, LowerFence = 159.520 ns, UpperFence = 179.925 ns
ConfidenceInterval = [166.704 ns; 173.574 ns] (CI 99.9%), Margin = 3.435 ns (2.02% of Mean)
Skewness = 1.24, Kurtosis = 3.81, MValue = 2
-------------------- Histogram --------------------
[163.486 ns ; 170.277 ns) | @@@@@@@@@@@@@@
[170.277 ns ; 174.981 ns) | @@@@@@
[174.981 ns ; 182.328 ns) | @@
---------------------------------------------------
// * Summary *
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.300
[Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------------------- |---------:|--------:|--------:|------:|--------:|----------:|------------:|
| WriteUsingByteArray | 155.8 ns | 3.15 ns | 6.22 ns | 1.00 | 0.06 | - | NA |
| WriteUsingMemory | 170.1 ns | 3.44 ns | 4.22 ns | 1.09 | 0.05 | - | NA |
// * Hints *
Outliers
StreamWriteBenchmark.WriteUsingByteArray: Default -> 11 outliers were removed, 16 outliers were detected (140.27 ns..144.75 ns, 173.99 ns..272.19 ns)
StreamWriteBenchmark.WriteUsingMemory: Default -> 4 outliers were removed (194.28 ns..237.52 ns)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Ratio : Mean of the ratio distribution ([Current]/[Baseline])
RatioSD : Standard deviation of the ratio distribution ([Current]/[Baseline])
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
Alloc Ratio : Allocated memory ratio distribution ([Current]/[Baseline])
1 ns : 1 Nanosecond (0.000000001 sec)
// * Diagnostic Output - MemoryDiagnoser *
// ***** BenchmarkRunner: End *****
Run time: 00:01:17 (77.32 sec), executed benchmarks: 2
Global total time: 00:01:23 (83.59 sec), executed benchmarks: 2
The final results tend to favor byte on WriteAsync overloads I swapped to 2000 elements because 10k was taking a long time. Obviously, smaller loads will obviously favour the byte version as there is a tiny reduction in CPU overhead.
This zip contains the code, project file and some recorded test runs benchmark-memory.zip
I would appreciate it if anybody had time to add the other test cases, for read, plus what happens when other sources for the original byte array are available. I have to get back to work now before they notice I'm gone :P
This is the final code:
using System;
using System.IO;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[MemoryDiagnoser]
public class StreamWriteBenchmark
{
private readonly byte[] data;
private readonly MemoryStream memoryStream;
public StreamWriteBenchmark()
{
// Pre-allocate the byte array with 10,000 bytes
data = new byte[2_000];
memoryStream = new MemoryStream(); // MemoryStream to write to
}
// Benchmark using the byte[] overload of WriteAsync
[Benchmark(Baseline=true)]
public async Task WriteUsingByteArray()
{
memoryStream.Seek(0, SeekOrigin.Begin); // Reset the stream position
await memoryStream.WriteAsync(data, 0, data.Length);
}
// Benchmark using the Memory<byte> overload of WriteAsync
[Benchmark]
public async Task WriteUsingMemory()
{
memoryStream.Seek(0, SeekOrigin.Begin); // Reset the stream position
var memoryData = new Memory<byte>(data); // Wrap the byte array with Memory<byte>
await memoryStream.WriteAsync(memoryData);
}
[Benchmark]
public async Task WriteUsingAsMemory()
{
memoryStream.Seek(0, SeekOrigin.Begin); // Reset the stream position
await memoryStream.WriteAsync(data.AsMemory(0, data.Length));
}
// Main method to run the benchmark
public static void Main(string[] args)
{
_ = BenchmarkRunner.Run<StreamWriteBenchmark>();
}
}
It is going to take me a bit to digest all that, but that is good work. And I have to admit that I'm not necessarily an expert in this area, so my review is only worth so much.
From what I looked at so far, that looks good, although is there a reason the histogram for the Memory<byte>
benchmark only has 3 rows while the byte[]
has 7?
If I had to guess, the performance difference would probably be close enough that the API/rule designers would say that it is justified by the consistent API, which is debatable. I don't know that I would be entirely sold on that either.
I also think there is a good chance that given its use of the ArrayPool
and some other things, the Memory<byte>
versions might gain more performance over time.
With that being said, I think we are looking at the wrong thing here in terms of the claim itself (but still an important thing to look at in general).
Essentially I'm challenging the claim that if one has started with a byte[], then wrapping it in a Memory before calling ReadAsync or WriteAsync will grant a performance boost.
I don't think that is the claim. It says it uses memory efficiently, which to me would imply memory management, so efficiency in space, more than time.
So to test this claim, I think you would at least have to include that. Because any performance differences might be balanced out by memory management differences.
The rule triggers if one starts with a byte[] - my suspicion being that the rule is overzealous.
I would have to go back and look at this in practice and probably the code for the rule. When I noticed it, it seemed to trigger just on ReadAsync(byte[], ...)
but I don't know what you mean by "if one starts with a byte[]"
I think what you are getting at, and what I would be interested in, is if it the rule tries (or could try) to check what you are doing before the ReadAsync
and only trigger if it looks like you are trying to do some of the stuff that the Memory<byte>
overload does for you. And if you really wanted to do it yourself then that would be a time that you might suppress the rule.
I do think there is an argument that this rule is overzealous either way. I am not a fan of the overzealous and under-configurable rules either. Some of them seem more opinionated than they need to be and this could be one of them.
My push back here is mostly only pointing out that the claim justifying the existence of the rule may not be wrong. But even if it isn't, to your point, if you just have a byte[] and aren't doing any "crazy" stuff with it, then it doesn't seem like this rule should trigger.
It is going to take me a bit to digest all that, but that is good work. And I have to admit that I'm not necessarily an expert in this area, so my review is only worth so much.
Thanks! I'm sorry it took so long, but I had moved on from this in a professional capacity
From what I looked at so far, that looks good, although is there a reason the histogram for the
Memory<byte>
benchmark only has 3 rows while thebyte[]
has 7?
TLDR; the variance on the smaller histogram is less; each bar is one range of values.
If I had to guess, the performance difference would probably be close enough that the API/rule designers would say that it is justified by the consistent API, which is debatable. I don't know that I would be entirely sold on that either.
They should say this. I think they just assumed it would be faster.
I also think there is a good chance that given its use of the
ArrayPool
and some other things, theMemory<byte>
versions might gain more performance over time.With that being said, I think we are looking at the wrong thing here in terms of the claim itself (but still an important thing to look at in general).
Essentially I'm challenging the claim that if one has started with a byte[], then wrapping it in a Memory before calling ReadAsync or WriteAsync will grant a performance boost.
I don't think that is the claim. It says it uses memory efficiently, which to me would imply memory management, so efficiency in space, more than time.
So to test this claim, I think you would at least have to include that. Because any performance differences might be balanced out by memory management differences.
My example does test memory; but it seems that neither technique performs any heap allocations. So by a strict definition of "uses memory more efficiently = needs less memory" their claim is false. All the options presented here allocate no extra memory. You cannot be more efficient than that.
The rule triggers if one starts with a byte[] - my suspicion being that the rule is overzealous.
I would have to go back and look at this in practice and probably the code for the rule. When I noticed it, it seemed to trigger just on
ReadAsync(byte[], ...)
but I don't know what you mean by "if one starts with a byte[]"I think what you are getting at, and what I would be interested in, is if it the rule tries (or could try) to check what you are doing before the
ReadAsync
and only trigger if it looks like you are trying to do some of the stuff that theMemory<byte>
overload does for you. And if you really wanted to do it yourself then that would be a time that you might suppress the rule.I do think there is an argument that this rule is overzealous either way. I am not a fan of the overzealous and under-configurable rules either. Some of them seem more opinionated than they need to be and this could be one of them.
My push back here is mostly only pointing out that the claim justifying the existence of the rule may not be wrong. But even if it isn't, to your point, if you just have a byte[] and aren't doing any "crazy" stuff with it, then it doesn't seem like this rule should trigger.
Now that we have the means to discuss this empirically, I suggest we try to bring in somebody who can actually act on this.
I just don't have time right now to setup a repo and cover all the use cases, or play games like devils advocate, steel man etc etc.
There's no reason to apologize. That is the same reason that it took me so long to get back to it. But I get the feeling you aren't interested in discussing it like this and would rather somebody else chime in, so fair enough.
Analyzer
CA1835: Prefer the memory-based overloads of ReadAsync/WriteAsync methods in stream-based classes
https://learn.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1835
Problem
The rule claims that the
Memory
-basedStream
method overloads have a more efficient memory usage than the byte array-based ones.The reason / explanation for this claim It is not clear. It fact, it is not clear what this claim is true at all.
In fact, looking at the implementation of
Stream.ReadAsync(Memory<byte> buffer, ...)
I see that it delegates to the byte-array based implementation, plus does some (minor) additional stuff.This suggests that the analyzer rule is either false, or that it is only applicable in certain scenarios, but it is not clear in which specific ones.
Solution options
The rule should be reviewed.
If the rule is false, it should be removed.
If the rule is true, the rule documentation should be improved to include an explanation of the reasons for the better performance of the
Memory
-bases APIs. Ideally, a similar explanation should also be included into the API docs of the affected Stream APIs. Ideally, the docs should point to some performance measurements, or to how they can be performed/replicated.If the rule is true only in specific circumstances, it should be clearly explained when it does and does not apply, and why. Then, the docs should include all of the above for the context of the specific circumstances in which the rule applies.
Workaround
Completely suppress the rule, as it currently only creates noise with no value. In
.editorconfig
, add:Relevant info
It seems that this has been raised in the past: https://github.com/dotnet/docs/issues/36438 , however, that issue was not properly resolved.