.NET 6 RC2 websocket receive allocations

ShakesB33r commented 3 years ago

I ran the ping pong allocation test as described in this thread under .NET 6 RC2, which I can see indeed has the PoolingAsyncValueTaskMethodBuilder attribute added to ReceiveAsyncPrivate (see here ), and yet I see allocations:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1288 (21H1/May2021Update) Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK=6.0.100-rc.2.21505.57 [Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT [AttachedDebugger] DefaultJob : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

Method	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
PingPong	135.8 ms	2.70 ms	3.22 ms	8333.3333	1000.0000	78 MB

Am I missing something?

By way of context, I performed the above test after profiling an app I'm writing to discover that currently the only source of steady-state allocations is the async operations in ClientWebSocket receives, as everything else is written to be garbage free.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

I ran the ping pong allocation test as described in [this](https://github.com/dotnet/runtime/pull/56282) thread under .NET 6 RC2, which I can see has the PoolingAsyncValueTaskMethodBuilder attribute added to ReceiveAsyncPrivate (see [here ](https://github.com/dotnet/runtime/blob/6b11d64e7ef8f1685570f87859a3a3fe1c177d0f/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs#L654)), and yet I see allocations: BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1288 (21H1/May2021Update) Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK=6.0.100-rc.2.21505.57 [Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT [AttachedDebugger] DefaultJob : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated | |--------- |---------:|--------:|--------:|----------:|----------:|----------:| | PingPong | 135.8 ms | 2.70 ms | 3.22 ms | 8333.3333 | 1000.0000 | 78 MB | Am I missing something?

Author:	ShakesB33r
Assignees:	-
Labels:	`area-System.Net`, `tenet-performance`, `untriaged`
Milestone:	-

stephentoub commented 3 years ago

and yet I see allocations Am I missing something?

It doesn't ensure it won't allocate: it enables pooling such that a large chunk of the allocation is likely to be avoided, but it doesn't guarantee there won't be any allocation. The pooling needs to trade off the cost of storing lots of objects, the cost of synchronization to enable one thread to access an object returned by another thread, etc., against the benefits of avoiding an allocation. You can see there are allocations listed in the allocated column even in the PR you linked to.

ShakesB33r commented 3 years ago

Hi @stephentoub , than you for the quick response, much appreciated.

What you say is fair enough but when you submitted the PR to enable valuetask pooling the test you ran showed very nearly 0 allocation and 0 GC which is in contrast to my run of the exact same test which shows significant allocation and GC. This seems at a minimum strange. I tried to run the same test under .NET 5 to get a comparison point but unfortunately the same test code does not work. If you could suggest to me the correct way that is portable between framework versions to achieve the result of the following code then I am happy to run that test.

        (Stream Stream1, Stream Stream2) streams = ConnectedStreams.CreateBidirectional();

stephentoub commented 3 years ago

I just tried running this, which is a completely self-contained repro that uses NetworkStream instead of that ConnectedStreams (which is a helper in this repo's tests):

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
using System;
using System.Linq;
using System.Net;
using System.Net.Sockets;
using System.Net.WebSockets;
using System.Threading;
using System.Threading.Tasks;

[MemoryDiagnoser]
public partial class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private class Connection
    {
        public readonly WebSocket Client, Server;
        public readonly Memory<byte> ClientBuffer = new byte[256];
        public readonly Memory<byte> ServerBuffer = new byte[256];
        public readonly CancellationToken CancellationToken = default;

        public Connection()
        {
            using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
            listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
            listener.Listen();
            var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
            client.Connect(listener.LocalEndPoint);
            Socket server = listener.Accept();

            Client = WebSocket.CreateFromStream(new NetworkStream(client, ownsSocket: true), isServer: false, subProtocol: null, Timeout.InfiniteTimeSpan);
            Server = WebSocket.CreateFromStream(new NetworkStream(server, ownsSocket: true), isServer: true, subProtocol: null, Timeout.InfiniteTimeSpan);
        }
    }

    private Connection[] _connections = Enumerable.Range(0, 256).Select(_ => new Connection()).ToArray();
    private const int Iters = 1_000;

    [Benchmark]
    public Task PingPong() =>
        Task.WhenAll(from c in _connections
                     select Task.WhenAll(
                      Task.Run(async () =>
                      {
                          for (int i = 0; i < Iters; i++)
                          {
                              await c.Server.ReceiveAsync(c.ServerBuffer, c.CancellationToken);
                              await c.Server.SendAsync(c.ServerBuffer, WebSocketMessageType.Binary, endOfMessage: true, c.CancellationToken);
                          }
                      }),
                      Task.Run(async () =>
                      {
                          for (int i = 0; i < Iters; i++)
                          {
                              await c.Client.SendAsync(c.ClientBuffer, WebSocketMessageType.Binary, endOfMessage: true, c.CancellationToken);
                              await c.Client.ReceiveAsync(c.ClientBuffer, c.CancellationToken);
                          }
                      })));
}

I ran it with:

dotnet run -c Release -f net5.0 --filter ** --runtimes net5.0 net6.0

to compare .NET 5 and .NET 6, and with that I got:

Method	Runtime	Mean	Error	StdDev	Ratio	Allocated
PingPong	.NET 5.0	1.558 s	0.0310 s	0.0653 s	1.00	180,230 KB
PingPong	.NET 6.0	1.516 s	0.0302 s	0.0424 s	0.99	418 KB

I don't see a problem here.

ShakesB33r commented 3 years ago

Runtime	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
.NET 5.0	1.139 s	0.1044 s	0.0271 s	16000.0000	4000.0000	176 MB
.NET 6.0	1.043 s	0.1663 s	0.0257 s	0	0	420 KB

My results agree with yours, thank you for this helpful example. I will try and identify what I have missed and close accordingly, it could just be as you said that there is no guarantee of being allocation free.

If it is due to the lack of such guarantee, is there any way to control the pooling behaviour? Eliminating GC pauses is quite important for me due to latency concerns, even at the expense of throughput.

stephentoub commented 3 years ago

If it is due to the lack of such guarantee, is there any way to control the pooling behaviour?

Not with the built-in pooling implementation. You can of course write your own async method builder (e.g. start with the source for the built-in pooling builder and modify it according to your needs), though that is non-trivial complexity and work.

ShakesB33r commented 3 years ago

I did actually try fiddling with the async method builder, so far I have managed to build but not yet successfully use my modified version. I have two ideas:

Make the thread local storage a small array (e.g. 4 items), which is mentioned as an idea in the code; this is quick and efficient to use, but not so space efficient.

Replace the per core cache array with an object pool implementation - I think this option has more mileage. There are multiple implementations already within the .NET ecosystem. This one or maybe This one could both be candidates. I believe this could increase reuse without noticeably hurting time to use the cache.

Any advice or feedback would be appreciated.

stephentoub commented 3 years ago

I believe this could increase reuse without noticeably hurting time to use the cache.

I expect you'll find it can make a very measurable and negative impact on throughput, especially under load where lots of threads are all contending, but also just because you're increasing the time it takes to search for a valid object. It's also not just about the time it takes to access the cache or synchronization costs incurred. If the object was being used on one core and now all of a sudden needs to be used on another, and in a writable manner as these objects are, you end up paying increased costs around cache thrashing. And on larger machines where memory access costs aren't uniform, this kind of frequent misalignment between what memory was used for the object and what core is using it can make a visible impact. Also keep in mind that pooling like this can have other negative impacts on the system, e.g. this is very likely going to increase gen2 references to gen0 objects, which can actually increase pause times as it makes some gen0 collections slower. This all contributes to it being really important to measure... a lot of work went into striking the right balance in the implementation shipped here in .NET 6. That's not to say it can't be improved, but any improvements would be a lot more subtle than just substituting in such an object pool like the ones cited.

I'm going to close this issue, as the system seems to be working as intended. Thanks.

ShakesB33r commented 3 years ago

Interesting, thanks for the added context. I did manage to run my modified builds and found the below on a ping pong test with 1024 sockets x 1000 iters using the connected streams method. I believe the conected streams method is lower overhead (runs roughly 10x faster than the network stream version) and as such does produce allocation and gc as the increased throughput means the lower cache size is not enough to satisfy all requests.

The results are interesting, I invite you to take a look. Curiously, the full .NET 6 implementation barely reduced allocation vs only per thread caching. Additionally, also curiously, I was not able to reduce allocation by doubling up on both the per thread and per core caching. Also worth noting that doubling up the caching did not reduce allocation on my real world app either.

Method	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
No Pooling	741.4 ms	85.02 ms	22.08 ms	109000.0000	8000.0000	1 GB
Per Thread Pooling only	577.6 ms	43.82 ms	11.38 ms	33000.0000	6000.0000	314 MB
.NET 6 implementation	646.5 ms	59.23 ms	15.38 ms	33000.0000	5000.0000	313 MB
.NET 6 implementation but with 2x per thread and per core pooling	709.2 ms	100.3 ms	26.05 ms	33000.0000	5000.0000	313 MB

stephentoub commented 3 years ago

Curiously, the full .NET 6 implementation barely reduced allocation vs only per thread caching.

There are many different kinds of workloads, with different characteristics. If the microbenchmark is such that the new operations are started immediately after a previous one completes, it will generally be satisfied from the per-thread cache; that is often not the case in more real-world usage. The worst case is when ops are started from one group of threads and complete on another, in which case the thread caches are typically useless. So, it varies.

dotnet / runtime

.NET 6 RC2 websocket receive allocations #61011