Simple modifications to enable zero-allocations for low-latency applications

augenix commented 1 year ago

Use of the Capnp.Net.Runtime in low-latency applications requires the use of a preallocation strategy for the Segment buffers. To make that work, the application must be able to recover the Segment buffers and return them to the allocation pool when they are no longer needed. This is made possible with two simple modifications.

Make the Segments property in DeserializerState a public property.
Modify the Code Generation to make public the DeserializerState ctx field in message READER in the code-behind.

These two modifications expose the DeserializerState ctx field and from there the Segments property is exposed and the Segments buffers can be returned to the pre-allocation pool.

If you can give me a hint on were to change the Code Generation I will do the modification and a PR.

c80k commented 1 year ago

In fact, DeserializerState.Segments already is a public property. Why should the ctx field made public, too? Could you point out some details on the intended ownership model? My initial thought was making the WireFrame type IDisposable, such that WireFrame.Dispose() returns the buffers to the pool. Both DeserializerState and the READER just "borrow" the ownership of the buffers. The main loop (calling Framing.ReadSegments) is be responsible to dispose the WireFrame after processing. To prevent breaking existing functionality (especially RPC), we might introduce a new class FrameReaderPA (PA = "with pre-allocation") which manages the memory pool and is meant as a replacement for the static Framing.ReadSegments() method. Hence, FrameReaderRA.ReadSegments() creates WireFrame instances with pre-allocated buffers, whereas Framing.ReadSegments() stays unchanged. Then it's up to the user whether to use the pre-allocation feature.

augenix commented 1 year ago

My application passes the message READER as a parameter in an event invocation to multiple consumers. When the use of the reader is finished, the final consumer calls ReturnArraysBuffer.Return(reader.ctx.Segments). The IReadOnlyList <Memory < ulong >> is returned to the preallocation pool where it is cast back to Memory < ulong >[] and finally returned to one pool for the Memory arrays and another pool for the ulong arrays.

The problem with making WireFrame dispose of the segments is that multiple consumers (running on separate threads) may be still processing the data in the Reader. If you dispose at the end of Framing.ReadSegments then the underlying buffer is destroyed and the Readers hold corrupted data. The Segments have to be tracked in each consumer and disposed of by the consumer when they use is completed. Thus, the need for a way to recover the Segments from reader.ctx.Segments when the consumer is done.

I recreated Framing.ReadSegments for my application for two purposes. First, I modified it to use the preallocated Memory < ulong > and preallocated ulong[] to build the Segments buffers instead of instantiating them in line (root cause of the GC issues). It also uses a BinaryReader that is instantiated at the class level. Second, instead of a blocking Read on the BinaryReader I look to see if the header/message data is available in the Socket Buffer. If not then go process outgoing messages and such and come back later. This requires a stateful method with states - ReadMessageHeader, ReadSegmentHeader (scount > 1), ReadSegmentData. The result is a WireFrame with the Segments buffer inside the IReadOnlyList <>. The WireFrame is put into a circular buffer to be processed by the application. It is on the other side of this buffer that the message type is determined and it is sent to the subscribed consumers in an event call. If there are no consumers subscribed for that message type then the Segments buffers are returned to the pre-allocation pool at that point. If there are subscribers then the subscribed consumer must properly return the Segments buffers to the pre-allocation pool when it is finished using the reader.

The FrameReader I describe above could be the new FramReaderPA. This seems like a reasonable idea to me and folks can decide which one they want to use.

Here are screenshots of dotMemory profiler before and after.

Before - here you can see GC happening 2-3 times per second. The majority of the allocations are for the segment buffers. Also note that there are 560k messages over an 18 second period. This can be much higher.

After - All of the message buffer and BinaryReader allocations are removed. What remains is some final optimization in my application.

c80k commented 1 year ago

The READER's DeserializerState ctx field is generated in ReaderSnippetGen.cs If you need this public, something like

        IEnumerable<MemberDeclarationSyntax> MakeReaderStructMembers()
        {
            yield return FieldDeclaration(
                VariableDeclaration(
                    _names.Type<Capnp.DeserializerState>(Nullability.NonNullable))
                .AddVariables(_names.ReaderContextField.VariableDeclarator))
                .AddModifiers(Readonly)
                .AddModifiers(Public); // this line is the new one. Make the field public.

should do the job (disclaimer: it's a long time ago that I wrote the code, and I didn't try it myself...)

drew-512 commented 1 year ago

Would love to see this PRed into main.

Great work @augenix -- makes total sense to ensure that various pooling approaches are possible (and are some of the biggest drawbacks of C# protobufs / grpc).

augenix commented 1 year ago

@drew-512 thank you for the feedback. I will soon get to the PR, but need to catch up on other items in the critical path.

@c80k Perhaps the better approach to this is not to make the ctx field public, but to leave it private and add a getter method that is public. This leaves the current encapsulation intact.

The implementation of various object preallocation and recovery strategies is dependent on the needs of the application. What works for me may not work in other use cases. Therefore, I don't think it makes sense to PR my implementation. All that is needed is to PR the method that allows recovery of the Segments from the Reader and users can go from there. I am happy to create something in the documentation that shows an example based on my approach.

c80k / capnproto-dotnetcore

Simple modifications to enable zero-allocations for low-latency applications #73