dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

[API Proposal] Add ReadOnlySequence<T>.TryGetChunk() extension method #27755

Open thargy opened 6 years ago

thargy commented 6 years ago

A common use case for a ReadOnlySequence<T> is to read data in packets of >1 element at a time. For example when reading an integer from a byte stream you need blocks of 4 bytes at a time, similar for a long (8 bytes). Hashing and cryptography algorithms often work in chunks follow by a potential tail, where chunks are often relatively small of 2, e.g. 32 bytes.

Searching I couldn't find any API for easily accessing a sequence in this way, instead the logical choice is to either enumerate into a buffer, or use a CopyTo to get data from the sequence into a Span, e.g.

int i = 0;
long blockSize = this.BlockSize;
long remainder = this.sequence.Length;
Span<byte> b = this.buffer;
while (remainder > 0)
{
    if (remainder < blockSize)
    {
        blockSize = remainder;
    }

    this.sequence.Slice(i, blockSize).CopyTo(b);
    i += this.BlockSize;
    remainder -= this.BlockSize;

    // Use b here
}

However, this always copies data, when in a great many cases the underlying sequence segment can be read without break, and copies only need to occur when traversing from one segment to the next. For example, consider the case of reading a long network stream in chunks of 32 bytes at a time, (e.g. to generate a hash), and the stream is presented in 4k segments.

The solution is to leverage the ReadOnlySequence<T>.TryGet() method, which returns the current sequence in a segment, allowing you to read a larger block, but this code is easy to get wrong. As this is such a common use case it would be nice to have an extension method in the framework that already implements the solution, which can allow for optimisations in future under the hood.

To that end I propose a new method ReadOnlySequence<T>.TryGetChunk() be added to BufferExtensions.cs. I have prototypes, tested, and performance analysed it at the following gist. In essence the method could like so:

public static bool TryGetChunk<T>(this in ReadOnlySequence<T> input, ref SequencePosition position, out ReadOnlySpan<T> output, int size, bool advance = true, in Span<T> buffer = default);

This takes a sequence, and position, and returns a chunk of length size as a ReadOnlySpan<T>. It only performs a copy when necessary, and instead returns the underlying Span when sufficient. Only when at the end of the sequence will it return a tail of less than size elements. Like TryGet() it has an advance parameter, though it doesn't move to the next segment, but respects the requested size. This makes processing blocks, even those as small as 1 byte, much easier.

For example the above code becomes:

SequencePosition position = this.sequence.Start;
Span<byte> b = this.buffer;
while (this.sequence.TryGetChunk(ref position, out ReadOnlySpan<byte> output, this.BlockSize, true, b))
{
    // Use output here, gauranteed to be BlockSize bytes, except on last segment
}

I haven't included the full test harness in the gist for compactness, but I tested the implementation with all kinds of segments (including ones with pads on either end of each segment), and scenarios, and it was robust. Although it does use SequencePosition.GetObject() in one position which may not be allowed in future according to dotnet/runtime#25837 (it would be easy to rewrite without though).

I've also included the a 3hr43m benchmark run, based on various sequence profiles and the results are stunning. On a dry benchmark the new method matched Slice-CopyTo on the worst cases (some single byte sequences, and some single segment sequences) and normally out performed it (about 40% faster in almost all scenarios). However, when running a thorough benchmark the performance improvements stand out massively (probably due to memory pressure of the massive copy operations underload). Although more detailed lab work would be needed, it seems to establish that there is merit in using such an approach.

Again, the prototype in the gist is heavily commented and represents a detail API proposal. You will note I've proposed passing an optional buffer into the TryGetChunk method, this allows the method to be supplied with a block from an ArrayPool<T> and, allows it to achieve zero-allocations. If this isn't supplied, the method will create a new T[size] array under the hood.

I plan on using this in a new, extremely fast xxHash64 implementation. Having sequences allows chaining blocks of data to pass into the hasher, without actually combining them into a contiguous block, or making repeated calls to the hash algorithm.

benaadams commented 6 years ago

Would this api help? https://github.com/dotnet/corefx/issues/32588

thargy commented 6 years ago

The proposed SequenceReader is extremely heavy weight in comparison and somewhat less flexible. I could see it being built atop this proposal which is a much lower level API, working directly on a ReadOnlySequence and returning a ReadOnlySpan. This proposal is really about making ReadOnlySequence much more readily usable as is, sitting alongside the existing TryGet method; without adding more APIs to allow consumption (making it more a first class citizen).

For example using my proposal reading 4 longs is trivial. Ask for an 32 byte chunk (incredibly quick - see the benchmarks) and then using MemoryMarshal.Cast to convert the returned Span<byte>:

// Get block of 4 longs
if (this.sequence.TryGetChunk(ref position, out ReadOnlySpan<byte> output, 32, true, buffer))
{
    ReadOnlySpan<long> longs = MemoryMarshal.Cast<byte, long>(output);
    long a = longs[0];
    long b = longs[1];
    long c = longs[2];
    long d = longs[3];
}

This will complete with no allocations (unless it crosses a segment boundary, and even then it can use the optional supplied buffer), and no copies, incredibly quickly.

thargy commented 6 years ago

Instead of taking an option buffer, it could also take an optional ArrayPool<byte>?

benaadams commented 6 years ago

Instead of taking an option buffer, it could also take an optional ArrayPool<byte>?

It would also need to then return the byte[] if rented, otherwise there would be no handle to return the rented array and it would be a more expensive version of new byte[]

thargy commented 6 years ago

Oops, good point, I knew there was a reason I did it that way! Serves me right for commenting on my mobile.

JeremyKuhne commented 6 years ago

The proposed SequenceReader is extremely heavy weight in comparison and somewhat less flexible.

How is it less flexible?

I could see it being built atop this proposal which is a much lower level API

While that seems reasonable at first blush, the thing I've found is that getting the span out of a sequence is incredibly expensive. There is no way you could get even close to the performance levels of the reader when retrieving more than one span.

While I'm not opposed to this extension, I'm concerned that it will encourage writing sub-optimal parsing code. I'd be interested in seeing numbers on 1+ read(s) of spans via the reader/this api.

thargy commented 6 years ago

Thanks for responding @JeremyKuhne.

How is it less flexible?

It was a poor choice of word, and I should have expanded, but at the time I was getting tired with writing long explanations for them to be dismissed without being read, so I apologise. The proposed extension is for retrieving fixed size blocks, not variable length blocks, and I believe it can be written in a way that is optimal. I meant it makes no constraint on why you want that particular size, or what you want it for. I gave the example of a hasher which works in fixed chunks sizes, but is just as performant for reading 1/4/8/16/32 blocks chunks as it is for large chunks. In that regard it has a single clear goal that has many uses (hence flexible). It can be used to read binary fixed length data, or it could be used for reading a variable length data with a length header. Basically anything, so long as the length required is known in advance. There are many scenarios where this is the case, as not all protocols rely on delimiters (e.g. 0 terminated strings, or CR/LF terminated lines, etc.). In my own use case (xxHash64 hashing) I ask for blocks of 32 bytes which I immediately cast to a Span of 4 longs, and it is blisterringly fast.

When I glanced at the proposed API in dotnet/runtime#27522 it seemed more focus on reading data until a delimiters (i.e. variable length); which is a very different scenario.

While that seems reasonable at first blush, the thing I've found is that getting the span out of a sequence is incredibly expensive. There is no way you could get even close to the performance levels of the reader when retrieving more than one span.

That was my precise reason for proposing the extension method in the first place, as writing optimal code for reading fixed size chunks is incredibly hard to get right. For example, @pakrym (who should rightly be considered an expert) suggested a way to make the baseline example 'naive' approach faster, that actually slowed it down (at least in the benchmarks I ran)!

It feels like this would be comparing apples and oranges, as my proposal asks for a fixed length, where yours asks for a variable one. I would be intrigued to know how you made your code so fast that my proposal "can't get close". Surely, some of those tricks could be leveraged here; as the key is to provide an optimal method for getting a fixed size chunk.

While I'm not opposed to this extension, I'm concerned that it will encourage writing sub-optimal parsing code. I'd be interested in seeing numbers on 1+ read(s) of spans via the reader/this api.

Did you manage to look at the gist I provided a full set of performance benchmarks as well as a full sample implementation, and sample benchmark code? You will note that it almost never needs to allocate/rent a new memory block. I would be interested to see how poor it is in comparison to using the reader for the same scenario.

Regardless, I think you're doing my proposal a disservice, it's primary goal is not parsing variable length data, but fixed length data. The example usages I've provided show how easy it is to consume for this scenario, compared to current methods. It is extremely useful for the many protocols that provide a length prior to a variable size chunk. For example, where strings have length byte/word prior to the string data.

I haven't seen your implementation, so I couldn't and won't comment on the relative performance of your code. I did note that you talk about using GetObject()==null, which is a trick I also leverage, but may not be allowed according to @pakrym as, apparently, the method's results are entirely implementation dependant and no gaurantees are made as to the values returned.

I also note that you propose an Allocator delegate, where my code takes an optional byte[] directly. As you're API is apparently approved, would it be possible to a standardise an approach for this, e.g. placing the delegate in a more central place (e.g. beside ArrayPool or similar), so it can be leveraged by more APIs? Have you measured the impact of the delegate call in comparison to having the array pre-available? I can certainly see how your code lends itself to the delegate more, and the performance hit might be worth it for the few times it's actually used (in most use cases) in my own proposal. In my own code there was no easy way to test if the optional buffer is used.

JeremyKuhne commented 6 years ago

It feels like this would be comparing apples and oranges, as my proposal asks for a fixed length, where yours asks for a variable one.

The SequenceReader does both. You can get a fixed size span using Peek. It is, in fact, the way we handle reading base types out and never allocating: https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader_binary.cs#L23-L51

Did you manage to look at the gist I provided

Yes, and that is what I based my comments on. You simply cannot call TryGetChunk multiple times and get anywhere close to the performance of the reader- you're effectively doing most of the work of creating the reader for every iteration. When allocations aren't mixed in to your perf, doing all of the analysis and casting needed to get to the Span eats up a significant amount of cycles. In Kestrel's header parser, nearly a third of the time in parsing the request line was simply getting the sequence's Span.

It is extremely useful for the many protocols that provide a length prior to a variable size chunk. For example, where strings have length byte/word prior to the string data.

I agree that this is an important scenario. Peek does what you want, albeit with the requirement that you manage the scratch space. I actually have a Peek implementation that was originally in the proposal that just allocates for you when you don't care.

        /// <summary>
        /// Peek forward up to the number of positions specified by <paramref name="maxCount"/>.
        /// </summary>
        /// <returns>Span over the peeked data.</returns>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public ReadOnlySpan<T> Peek(int maxCount)
        {
            ReadOnlySpan<T> firstSpan = UnreadSpan;
            if (firstSpan.Length >= maxCount)
            {
                return firstSpan.Slice(0, maxCount);
            }

            // Not enough contiguous Ts, allocate and copy what we can get
            return PeekSlow(new T[maxCount]);
        }

I did note that you talk about using GetObject()==null

Yeah, that isn't fundamental to the performance gains in the reader when working with spans. It is a minor tweak I made to try and generate the SequencePosition a little faster when returning ReadOnlySequence<T> slices.

I also note that you propose an Allocator delegate

The intent is to put it in the System.Buffers namespace. As you note, it is particularly useful at the lowest level and as such I plan to iterate on that idea a bit more before actually making it public. Controlled allocation isn't blocked by it as you can work with ReadOnlySequence APIs or use Peek.

thargy commented 6 years ago

The SequenceReader does both. You can get a fixed size span using Peek. It is, in fact, the way we handle reading base types out and never allocating:

Thanks for explaining, I wasn't sure from the issue where to find the code and wasn't exactly sure how Peek was supposed to work from the issue alone.

Is the plan for this line to use the Allocator:

 return PeekSlow(new T[maxCount]);

or do you intend consumers to use Peek(Span<T> copyBuffer) to prevent allocations?

Now that I can see your code and have read around a little, I can see that you effectively keep hold of the CurrentSpan between calls, saving the call to TryGet until necessary. You've also got the benefit of micro-optimising the fast path, by splitting code into seperate xxxSlow methods I guess it's making the inlining more likely for the fast path? The same trick could be used on my sample implementation.

Yeah, that isn't fundamental to the performance gains in the reader when working with spans. It is a minor tweak I made to try and generate the SequencePosition a little faster when returning ReadOnlySequence slices.

I also see that you leverage the magic GetInteger() flag bit along with GetObject() nullability. I got into quite a fight on dotnet/runtime#27751 because these are supposed to be implementation specific. In reality, I think this reinforces my contentino that there are legitimate use cases for understanding more about a SequencePointer; though I acknowledge you expect these to be moved into the ReadOnlySequence<T> helpers, it seems clear that the API has at least some guarantees it needs respecting to prevent bugs. The nullability check had a notable impact in my sample code performance, as it prevented a TryGet under certain scenarios.

There's no way for me to know how much faster using Peek is than a properly optimised version of TryChunk, though I conceed it will almost certainly be faster, perhaps except in various edge cases (like low number of reads where the allocation of the Reader struct itself may impact results). Ultimately, 15 years of .NET tuning have made me always check my assumptions about optimisations, as you can easily be surprised when you run the benchmarks. I would contend the TryGetChunk API has the benefit of discoverability (being immediately visible on a ReadOnlySequence), though discoverability seems less a concern on performant APIs. I think it would be interesting to see if TryChunk performs better/comparable for a reasonable number of reads which might justify it's existence on it's own, and it could act as a signpost (via intellisense/documentation) to the Reader for heavier use cases?

Also, it could be renamed Peek or TryPeek to make it equivalent to the Reader method?

Finally, though I suspect it wouldn't help due to the enumerator overhead, an IEnumerable<T> GetChunks(size) on a sequence would allow the current chunk to be held between yields, and would be of value where the size doesn't change between calls (e.g. reading a series of 8 byte longs, etc. That might outperform the reader when properly optimised, though the use is far more constrained?

JeremyKuhne commented 6 years ago

Is the plan for this line to use the Allocator:

That is part of what I'm struggling over. A little extra input would help me push this forward. :) I can see trying to put this in for the first preview with the Allocator.

by splitting code into seperate xxxSlow methods I guess it's making the inlining more likely for the fast path?

Yes, and/or makes the aggressive inlined code less impactful (less bloat).

I also see that you leverage the magic GetInteger() flag bit along with GetObject() nullability. I got into

The one for Position I'm removing in the CoreFX PR I'm currently composing. I removed it in my validation PR for Kestrel https://github.com/aspnet/KestrelHttpServer/pull/3068. The first span helper is moving into the type as an internal method for now. We can start a discussion about whether or not we should make it public as a follow-up.

I think it would be interesting to see if TryChunk performs better/comparable for a reasonable number of reads which might justify it's existence on it's own

I absolutely think the decision making here should be driven by perf measurements. I'm very interested in seeing actual numbers, but my gut feel from months of grinding on this code is that even for a single read you'll not be able to get appreciably faster than creating a reader. If you see any advantage past one read I'd be very surprised. When you look at doing common things like reading out multiple intrinsic types (such as int) it will be very hard to beat the zero allocation capability of the reader.

No matter the outcome here, it would be profoundly helpful to try and validate the usefulness of the reader APIs against your scenarios. The "real world" scenarios I have at this point to vet the API are JSON and Kestrel header parsing.

thargy commented 6 years ago

That is part of what I'm struggling over. A little extra input would help me push this forward. :) I can see trying to put this in for the first preview with the Allocator.

My vote would be a standardised way for supplying an Allocator throughout the framework that can also be used by external implementors. Having a standard delegate that is accepted wherever methods 'may' require allocations is a valid approach, but really the issue here is lifetime (I came unstuck above when I forgot this).

For example an implementation that effectively rented from an ArrayPool would have to remember to return the rented buffer once it is done with, but it may not always be requested, so this will lead to implementations that supply closures, etc.

An API that accepted a delegate that returned an IMemoryOwner<T> and returned an IMemoryOwnere<T> instead of a ReadOnlySpan<T> could return a reference-counting IMemoryOwner<T> when using the existing 'Memory' and return the supplied IMemoryOwner<T> when using the delegate to retrieve one. But this is supposed to be a performant API, and I really don't think that helps. As such, I think your existing delegate proposal is a reasonable compromise already, and is a step up from taking an optional T[] as my sample does. The question I have is, in the use case I provided, renting a buffer in advance and reusing it allows accepting an array instead of a delgate avoid the occasional delegate call. Traditionally, we could expose both methods (accept an array or a delegate) via an overload, but that will create bloat, so evaluating the actual performance impact might be a worthwhile exercise as I doubt it justifies having both approaches available.

If you do adopt your delegate approach having a way for ArrayPool<T> and MemoryPool supply one would be really nice, but what would that look like? Maybe something like -

// Factory interface
public interface IAllocatorFactory<T> : IDisposable
{
    // Matches Allocator delegate
     Span<T> Allocator(int length);
}

// To use we get an allocator factory using an extension method
// The factory tracks any memory allocated based on calls and releases
// on disposal.  A similar extension could be available on MemoryPool
using (IAllocatorFactory<T> factory = ArrayPool<T>.Shared.GetAllocatorFactory())
{
    sequence.Peek(...., factory.Allocator);
}

Clearly there are allocations going on here (though the underlying AllocatorFactory implementation could potentially be a struct), and there's interface boxing. The overhead might be negligible when making lots of Peek calls though?

The biggest problem I have with this, is that you're highly likely to see code like:

using (IAllocatorFactory<T> factory = ArrayPool<T>.Shared.GetAllocatorFactory())
{
    do
    {
        sequence.Peek(...., factory.Allocator);
    } while (...);
}

Which could needlessly hold onto multiple allocated buffers as it creates new buffers rather than reuse the buffer in each loop.

To do it properly without help you'd probably see code like

T[] buffer = null;
Allocator allocator = i => 
    { 
        if (!(buffer is null)) 
        {
            // Reuse existing buffer
            if (buffer.Length >= i) return buffer.Slice(0,i);

            // Return existing buffer
            ArrayPool<T>.Shared.Return(buffer);
        }

        // Rent a buffer
        buffer = ArrayPool<T>.Shared.Rent(i)
        return buffer.Slice(0, i);
    };

do {
        sequence.Peek(...., allocator);
} while (...);

// Clean-up any allocated buffer
if (!(buffer is null)) ArrayPool<T>.Shared.Return(buffer);

(Forgive all the inevitable typos, etc. as I'm writing this straight into Git as I don't have my dev environment accessible to test these sample snippets, these are just for illustration)

This would work fine were you're happy to re-use buffers between calls.

You could allow for that by making the behaviour of IAllocatorFactory always create a single, growable, buffer, or you could make it optional, e.g. GetAllocatorFactory(bool reuse) could return one of either ReusableAllocatorFactory when reuse was true or AllocatorFactory when false both of which implement IAllocatorFactory. ReusabelAllocatorFactory would only ever need to hold a single buffer, so could readily be a struct. AllocatorFactory, in contrast, would require a growable list/bag of buffers. You could also have two different extension methods, e.g. GetReusableAllocatorFacotry or similar.

The beauty of an interface if you can extend to any kind of strategy you like. And if you look at the kind of code people might write to use an Allocator delegate effectively in practice, there's a pretty good case to provide a factory for them which can be carefully tuned. Realistically it wouldn't be uncommon to see people calling Array<T>.Shared.Rent() and not calling Return correctly.

A simple default IAllocatorFactory, could effectively be an 'empty' struct (yeah I know it gets padded) with an empty Dispose() implementation, that just returns a new array of T. Maybe call it DefaultAllocatorFactory or HeapAllocatorFactory or GCAllocatorFactory? A DebugAllocatorFactory or TestAllocatorFactory could include a counter for every time an allocation was requested, a LoggingAllocatorFactory could log allocations, etc.

The other benefit of having Allocator factories, is that it would give a logical place to put the Allocator delegate!

All of this is a little similar to IMemoryOwner<T> and, if you knew in advance how big a buffer was required (which your code doesn't always so this is moot) then an implementation of that interface that only got the buffer when Memory was accessed would work.

All of this is a bit off topic for this case, but I'd be nervous about raising it as an API suggestion as I've pretty much struck out so far with contributions 🤷‍♂️

JeremyKuhne commented 6 years ago

@thargy Sorry about the late response, little pressed for time the past couple days.

Note that I yanked Peek() and introduced TryCopy() based on feedback from our architects. It makes no allocation straight reading a two step process, but that was what I was doing anyway in my usages of Peek() (check UnreadSpan, fall back to TryCopy()).

With an Allocator in place I can see adding a convenience TryRead(int count, Allocator allocator = null). I'd personally also like something like ReadUpTo(int maxCount, Allocator allocator = null).

The key feedback I got from the design review is that people would much rather have each of the methods that might allocate take the Allocator rather than the main constructor.

Beyond that the big things for me are that we don't make it super hard to avoid allocating and that we're somewhat useable/readable.

An interface is nice and it would be possible to avoid boxing, but not without some uncomfortable syntax and API/type bloat. You have to do something like:

public bool TryReadTo<TAllocator>(out ReadOnlySpan<T> span, T delimiter, TAllocator allocator, bool advancePastDelimiter = true)
    where TAllocator : IAllocator
{
    span = allocator.Allocate(42);
    return false;
}

All of this is a bit off topic for this case, but I'd be nervous about raising it as an API suggestion as I've pretty much struck out so far with contributions

I think the thing in this case is to try to just experiment in https://github.com/dotnet/corefxlab/. I plan to do just that around the allocator concepts with the BufferReader code- I'd be happy to have you throw in some prototypes as well. We can go through the various approaches and measure perf / allocation impacts to generate an official proposal.

Don't get too disheartened by "striking out". Helping push things forward requires a lot of abandoned code/ideas, but that isn't a loss or wasted time. It's all making .NET a better platform. Sometimes abandoned/rejected ideas are even more important than what gets shipped, imho. I personally consider everything that goes on in PR's / issues part of what we "ship".