davidfowl / BedrockFramework

High performance, low level networking APIs for building custom servers and clients.
MIT License
1.04k stars 152 forks source link

[WIP] Initial WebSocket protocol implementation #73

Open mattnischan opened 4 years ago

mattnischan commented 4 years ago

Addresses #62. This is not fully complete yet. The read side API looks like how I want it, but the write side is still very basic and not the shape I would like yet. Needs a bunch more tests and some final protocol details.

mattnischan commented 4 years ago

Initial benchmarks on the read side without digging into optimization:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

|                      Method |       Mean |    Error |  StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------------------- |-----------:|---------:|--------:|------:|--------:|-------:|------:|------:|----------:|
|         WebSocketReadMasked |   597.4 ns |  7.14 ns | 6.68 ns |  1.00 |    0.00 | 0.0277 |     - |     - |     120 B |
| WebSocketProtocolReadMasked | 1,210.5 ns | 10.04 ns | 9.39 ns |  2.03 |    0.03 | 0.0153 |     - |     - |      64 B |
mattnischan commented 4 years ago

Got rid of SequenceReader in WebSocketFrameReader and optimized some of the sync ValueTask paths in WebSocketMessageReader to avoid some state machine creation. More to do there.

Did a little work on lowering the garbage. Was getting a bunch of boxing of WebSocketFrameReader since ProtocolReader.ReadAsync() takes an interface. No sense in keeping those structs if they're just gonna get boxed. I can probably do that too with the payload reader, but will need to add methods to it to reset its state so it can be shared.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

|                      Method |     Mean |    Error |   StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------------------- |---------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
|         WebSocketReadMasked | 617.6 ns | 12.34 ns | 16.48 ns |  1.00 |    0.00 | 0.0286 |     - |     - |     120 B |
| WebSocketProtocolReadMasked | 969.3 ns | 14.48 ns | 12.84 ns |  1.57 |    0.05 | 0.0095 |     - |     - |      40 B |
davidfowl commented 4 years ago

Got rid of SequenceReader in WebSocketFrameReader and optimized some of the sync ValueTask paths in WebSocketMessageReader to avoid some state machine creation. More to do there.

Would be good to understand why https://github.com/davidfowl/BedrockFramework/issues/69

mattnischan commented 4 years ago

Creating a SequenceReader was showing up as a hot path. Just getting rid of creating an instance of it and moving to parsing via Span got me a 20% win by itself.

If we're forced to take that overload and always create a SequenceReader it's going to be hard to hit BCL numbers. As is, right now, the BCL version is just so much more at the raw level that even after mangling my code such that all the sync paths don't allocate any more async state machines (except in ProtocolReader, which I haven't touched), that's only getting me an additional 5-10% over what you see in the latest bench.

mattnischan commented 4 years ago

Here's the latest with a quiet machine: looks like maybe not even that 5-10%. Comes up allocation free, now, though (although if you actually profile, ProtocolReader.Read allocates a state machine still).

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
AMD Ryzen Threadripper 1950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

|                      Method | Categories |     Mean |   Error |  StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------------------- |----------- |---------:|--------:|--------:|------:|--------:|-------:|------:|------:|----------:|
|               WebSocketRead |   Unmasked | 476.0 ns | 4.86 ns | 4.55 ns |  1.00 |    0.00 | 0.0286 |     - |     - |     120 B |
|       WebSocketProtocolRead |   Unmasked | 744.0 ns | 4.38 ns | 4.10 ns |  1.56 |    0.02 |      - |     - |     - |         - |
|                             |            |          |         |         |       |         |        |       |       |           |
|         WebSocketReadMasked |     Masked | 559.8 ns | 3.30 ns | 2.76 ns |  1.00 |    0.00 | 0.0286 |     - |     - |     120 B |
| WebSocketProtocolReadMasked |     Masked | 882.4 ns | 4.02 ns | 3.76 ns |  1.58 |    0.01 |      - |     - |     - |         - |
davidfowl commented 4 years ago

https://github.com/davidfowl/BedrockFramework/issues/20

mattnischan commented 4 years ago

Yeah, I was thinking on how to eliminate that. I can do it for the sync path just by doing what I've done here and check IsCompletedSuccessfully and have a non-state machine path, but that won't help the async path (nor does it for my code currently). I have another idea but I'll throw it on that issue.

mattnischan commented 4 years ago

Got the shape of the write API in a decent place finally and got some smoke tests implemented. Been taking longer than I wanted.

Should be getting some benchmark numbers in the next day or two.

davidfowl commented 3 years ago

Next year or 2 😄

mattnischan commented 3 years ago

😬 I know...this year got a little crazy...

Are you thinking of bringing this project back into the forefront?