Thoughts for SocketOutput fine-tuning

lodejard commented 9 years ago

Imagine the state of the SocketOutput had a "buffers pending" list and a "writes sending" count.

A buffer in this case would be an object that has an IntPtr to native memory, the actual length of that native memory, the offset of active data in the buffer, and the length of active data in the buffer.

The actual IntPtr and length of native memory would come from a pool. That pool would VirtualAlloc a larger contiguous space at once, and hand out references to buffers which are aligned and sized on page boundaries (4k).

When a Write comes into the SocketOutput, if there are "buffers pending" and the final buffer has trailing space between the end of active data and end of native memory, it would first copy as much of the outgoing data (possibly all) into the existing buffer. If there are more bytes to send which have not been copied, a new 4k buffer is acquired from the pool and the process is repeated until all bytes to send are copied to "buffers pending"

If at the end of this process the "writes sending" count is less than three, increment it, and Post a callback on the uv thread. In that callback, all buffers are removed from the "buffers pending" list and sent to uv. When the uv write callback is invoked then the buffers which were written are returned to the pool. Then, if there are "buffers pending" remove them all and send to uv. In either, if a "buffers pending" check reveals that list is empty, decrement the "writes sending" count.

The SocketOutput will need a simple lock object at both of these points to ensure the data structures aren't compromised.

Tragetaschen commented 9 years ago

The following is a quick brain dump. It's short, not rude 😊.

What's the advantage of pooling allocated memory in a garbage collected environment?
Why use native memory instead of the blittable byte[] and do a copy?
Isn't there already a lot of buffering going on in the upper layers? (buffer bloat)

benaadams commented 9 years ago

Some ideas...

Create the buffer pools per Kestrel thread and take memory from and give back to the pool on the thread associated with the socket; which should encourage CPU cache and NUMA node locality.

For something like, keep the code the same; but allocate these pooled buffers in chunks larger than 85k so they are allocated in the LoH and outside of regular GC work. Allocate buffers in a x^2-1 batch

For the pool itself you could slice the buffers with array segments and put them onto and remove from a ConcurrentQueue.

Or... you could use a ConcurrentQueue<int> which has an implied offset (count always being the same); and has no data copy or heap allocations. Use the integer to refer to both the offset and allocation batch and determine with a bitshift. So if you allocate 1024 buffers at a time and they are 4096 is size (4MB) then its:

public struct LeasedBuffer : IDisposable
{
    internal LeasedBuffer(int id, ArraySegment<byte> buffer, ConcurrentQueue<int> owningPool)
    {
        _id = id;
        _owningPool = owningPool;
        _disposed = false;
        _constructedCorrectly = true;
        Buffer = buffer;
    }
    private int _id { get; }
    private ConcurrentQueue<int> _owningPool;
    private bool _disposed;
    private bool _constructedCorrectly;
    public ArraySegment<byte> Buffer { get; }

    public void Dispose()
    {
        if (!_disposed && _constructedCorrectly)
        {
            _disposed = true;
            _owningPool.Enqueue(_id);
        }
        else
        {
            throw new WhoaCodeGoneWrongException();
        }
    }
}

ConcurrentQueue<int> _pool = new ConcurrentQueue<int>();
List<byte[]> _buffers = new List<byte[]>();
List<GCHandle> _pins = new List<GCHandle>();

public LeasedBuffer LeaseBuffer()
{
    int bufferId;
    if (_pool.TryDequeue(out bufferId))
    {
        return PrepareLeasedBuffer(bufferId);
    }
    else
    {
        int startAt;
        lock (_buffers)
        {
            if (_pool.TryDequeue(out bufferId))
            {
                return PrepareLeasedBuffer(bufferId);
            }
            else
            {
                var buffer = new byte[4096 * 1024]; // Large enough to be in LoH
                var gcHandle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
                _pins.Add(gcHandle);

                startAt = _buffers.Count * 1024;
                _buffers.Add(buffer);
            }
        }
        // go backwards else its annoying to debug inital allocs
        for (var i = startAt + 1023; i > startAt; i--)
        {
            _pool.Enqueue(i);
        }
        return PrepareLeasedBuffer(startAt);
    }
}

private LeasedBuffer PrepareLeasedBuffer(int bufferId)
{
    var poolId = bufferId >> 10; // 1024 chunks
    var offset = (bufferId & 0x3ff) << 12; // (id % 1024) * 4096
    var buffer = new ArraySegment<byte>(_buffers[poolId], offset, 4096);
    return new LeasedBuffer(bufferId, buffer, _pool);
}

So this bit:

 var copy = new byte[buffer.Count];
 Array.Copy(buffer.Array, buffer.Offset, copy, 0, buffer.Count);
 buffer = new ArraySegment<byte>(copy);

Would become

var leasedBuffer = _pool.LeaseBuffer(); // when current buffer full
var copy = leasedBuffer.Buffer;
Array.Copy(buffer.Array, buffer.Offset, copy.Array, copy.Offset, buffer.Count);
buffer = leasedBuffer.Buffer;

Though you'd need to keep track of last write, how much is left in buffer etc as you have stated above. When you are done with the buffer return it to pool with:

leasedBuffer.Dispose();

Which all it does is add an int back to the queue; rather than either copying a struct or having a bunch of pooled objects on the heap forever if they were classes.

Also when pooling, don't use the first 32 or last 32 bytes of the buffer; or always leave 64 bytes free at the end if you prefer. So in this example only 4032 of 4096 bytes would be in active use. This is to ensure you don't get false sharing/caching effects on the CPU and false data contention between threads.

Obviously you want to allocate an initial LoH buffer for each of the threads at startup; after that do the usual:

GC.Collect(2, GCCollectionMode.Forced, true);
GC.WaitForPendingFinalizers();
GC.Collect(2, GCCollectionMode.Forced, true);

Add the compact LoH option if you've got it; but only seems to be in 4.6 not core at this time. After that point pin the buffers; so it doesn't mess other stuff up before the GC.

Using managed memory means you can still do copies etc using BlockCopy rather than doing anything a bit weirder with Marshal.Copy or some unsafe pointers. Pinning the buffers in alloc both at start up; and in the above code means you don't have to pin on each native transition. Having the buffers in the LoH means the pinned buffers aren't getting in they way of regular GC; and allocating large buffers means there is only a single reference for the GC to briefly check to see if its still in use for very many buffers.

NUMA node allocation is a bit trickier; but may be helped by forcing the threads to have processor affinity; then doing the start up allocations on the running threads; then pausing and running the collect; but cross that bridge when you come to it - probably a bit of a rabbit hole...

Got a bit carried away; hope some of it helps...

lodejard commented 9 years ago

I love all of these ideas. The queue would probably need to be of some kind of object that would re-populate the slot if it was garbage collected, to avoid ending up with gaps if an item isn't returned for any reason.

Pre-allocating over 64k per thread on startup might be a bit heavy, though I agree with the idea of loop-memory affinity.

Do you have links to a reference that talks about avoiding the first and last 32 bytes in a physical page? I know you want to pin upfront and align the slices on physical boundaries, but I wasn't aware that the first and last bytes of a memory page were treated differently than the center.

benaadams commented 9 years ago

Pre-allocating over 64k per thread on startup might be a bit heavy, though I agree with the idea of loop-memory affinity.

If you have a thread per core, allocating 100k per thread on a 16 core machine is only 1.56MB? And allocation should be pretty much instant. Though that would only give you 25 buffers per thread at 4096 per buffer.

Do you have links to a reference that talks about avoiding the first and last 32 bytes in a physical page?

The x86/x64 CPU<->RAM link reads and writes in atomic 64 byte chunks, which it puts in its 64 byte cache lines. It also does pre-fetching to get the next 64 byte chunk which is why arrays are super fast vs linked list (though the .net GC does it best to optimise for this)

If you want to deep dive, this is a good video by Herb Sutter @hsutter-msft: https://www.youtube.com/watch?v=L7zSU9HI-6I (1hr 51min)

However; this 64 byte atomicity means if two threads on two different cores (or worse socket) operating on adjacent chunks of memory within the same cache line they have to invalidate each others cache and ping pong it back through RAM. As RAM is 200x slower than L1 cache and 20x slower than L2 cache this can cause CPU stalls; especially as the CPU runs at least x4 faster than even L1 cache.

If you are operating on a nice aligned 64 byte multiples (like 4096) then you are probably ok; however its still going to pre-fetch the start of another cache line; which it will then invalidate. So the 64 byte 'guard' buffer keeps down the hardware synchronisation the memory controller is doing. Also with aligned memory you are better adding a full 64 byte chunk at end.

You don't want to do this everywhere just where you may hit nearby write contention; which pooling will exacerbate; its often one of the main reasons why a simple implementation of an embarrassingly parallel algorithm won't scale with core count; even though there is no obvious code reason for it.

Lots of stuff on false sharing: https://en.wikipedia.org/wiki/False_sharing

http://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206 https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads http://www.codeproject.com/Articles/51553/Concurrency-Hazards-False-Sharing

Generally false sharing isn't something to worry about in day-to-day programming; but if Kestrel is doing 164k rps the performance is already way beyond that; that's a request served every 6 microseconds!

benaadams commented 9 years ago

The RIO server (which is disgusting heavy on RAM usage) uses some of these techniques:

Buffer size: (MTU - (IPv4 + TCP header)) * 4 = 5840 bytes, 4 TCP packets worth but not divisible by 64. Real buffer size: Buffer size + Rounding + 64 byte guard = 5952 bytes (93 * 64 bytes)

Allocate GC Pin Slice Round robin Thread Accept Get buffers from thread Managed Copy Interop with managed Memory and no extra pinning

One of the reasons for the high RAM use is I don't recycle the buffers very fast as RIO has no OS level buffers so if you don't have them waiting you will drop packets, and you have to register the buffers up front; regular winsock doesn't have this issue.

Also the servers generally have many GBs of unused RAM so using 500MB for buffers isn't hugely problematic for my use case. However running on an IoT or in many containers it might be more important - though shouldn't be too hard to change.

The buffer pool in the comment above is better than the one in the RIO server; though I wrote it in the comments rather than in VS so it might not compile :-/

The other good technique with async is "one in the pipe" so kick off an async process; and await the previous one; which is probably completed by that point so does a fast path on the result.

I don't do a .ConfigureAwait(false) here because I can't as its a custom awaiter and I think that's solved by using the ICriticalNotifyCompletion interface; but otherwise they all should be.

The Receive tasks are pooled reset and resused as per #165 and it only allocates memory on a socket accept time; but not through using the socket.

Other than the IOCP stuff, and RIO; that's probably mostly a break down of what it does... Well there are other weirder things; but not sure how important they are.

Again HTH

benaadams commented 9 years ago

buffers which are aligned and sized on page boundaries

Looks like page alignment might be a bad thing:

How Misaligning Data Can Increase Performance 12x by Reducing Cache Misses

benaadams commented 9 years ago

Looks like .ConfigureAwait is no longer necessary as asp.net 5 drops AspNetSynchronizationContext which was used in previous versions - unless you are supporting asp.net < 5 & mvc < 6 on Kestrel, when it would still be important. ref: https://github.com/aspnet/Mvc/issues/3057

muratg commented 9 years ago

@lodejard @DamianEdwards is this something we'll be able to do in RC1?

muratg commented 9 years ago

@lodejard @DamianEdwards moving out for now. If you want it in RC1, let's prioritize against other RC1 kestrel work.

aspnet / KestrelHttpServer

Thoughts for SocketOutput fine-tuning #117