API Proposal: Add SpanReader<T> that aligns with SequenceReader<T>

JeremyKuhne commented 3 years ago

Background and Motivation

SequenceReader<T> was fine tuned to give the best performance for ReadOnlySequence<T> and explicitly didn't support a second state constructed around a ReadOnlySpan<T> directly.

There is no practical way to turn a ReadOnlySpan<T> into a ReadOnlySequence<T>. Providing a reader that follows a similar pattern when reading ReadOnlySpan<T> would help people reading spans directly and would allow further sub-parsing of spans returned from SequenceReader<T> methods.

Proposed API

The proposed API follows along with SequenceReader<T> with the following differences:

No ReadOnlySequence<T> methods.
Lengths are int rather than long as spans are constrained to int.

namespace System.Buffers
{
    public ref struct SpanReader<T> where T: unmanaged, struct, IEquatable<T>
    {
        public SpanReader(ReadOnlySpan<T> span);
        public int Consumed { get; }
        public bool End { get; }
        public int Length { get; }
        public int Remaining { get; }
        public ReadOnlySpan<T> UnreadSpan { get; }
        public void Advance(int count);
        public int AdvancePast(T value);
        public int AdvancePastAny(T value0, T value1);
        public int AdvancePastAny(T value0, T value1, T value2);
        public int AdvancePastAny(T value0, T value1, T value2, T value3);
        public int AdvancePastAny(ReadOnlySpan<T> values);
        public void AdvanceToEnd();
        public bool IsNext(T next, bool advancePast = false);
        public bool IsNext(ReadOnlySpan<T> next, bool advancePast = false);
        public void Rewind(int count);
        public bool TryAdvanceTo(T delimiter, bool advancePastDelimiter = true);
        public bool TryAdvanceToAny(ReadOnlySpan<T> delimiters, bool advancePastDelimiter = true);
        public bool TryCopyTo(Span<T> destination);
        public bool TryPeek(out T value);
        public bool TryPeek(int offset, out T value);
        public bool TryRead(out T value);
        public bool TryReadTo(out ReadOnlySpan<T> span, ReadOnlySpan<T> delimiter, bool advancePastDelimiter = true);
        public bool TryReadTo(out ReadOnlySpan<T> span, T delimiter, bool advancePastDelimiter = true);
        public bool TryReadTo(out ReadOnlySpan<T> span, T delimiter, T delimiterEscape, bool advancePastDelimiter = true);
        public bool TryReadToAny(out ReadOnlySpan<T> span, ReadOnlySpan<T> delimiters, bool advancePastDelimiter = true);
    }

    public static class SpanReaderExtensions
    {
        TryReadBigEndian(this ref SpanReader<byte>, out short);
        TryReadBigEndian(this ref SpanReader<byte>, out int);
        TryReadBigEndian(this ref SpanReader<byte>, out long);
        TryReadLittleEndian(this ref SpanReader<byte>, out short);
        TryReadLittleEndian(this ref SpanReader<byte>, out int);
        TryReadLittleEndian(this ref SpanReader<byte>, out long);
    }
}

Usage Examples

SpanReader<byte> reader = new(data);
if (!reader.TryAdvanceTo(Identifiers.StartOfText, advancePastDelimiter: false))
{
    throw new InvalidDataException("No start text (STX) control character found.");
}

ushort checksum = CalculateFileChecksum(reader.UnreadSpan);
reader.Advance(1);

if (!reader.TryReadTo(out var designSpecification, Identifiers.FieldTerminator))
{
    throw new InvalidDataException("Did not find design specification.");
}

info.DesignSpecification = Encoding.ASCII.GetString(designSpecification);
reader.AdvancePastAny(Identifiers.LineSeparators);

Alternative Designs

Adding direct ReadOnlySpan<T> support to SequenceReader<T> is technically possible but would have a negative performance impact that is measurable in key web scenarios. It also would be hard for consumers to understand the implications of calling overloads that have ReadOnlySequence<T> outputs (do we copy to an array?).

ghost commented 3 years ago

Tagging subscribers to this area: @tannergooding, @pgovind, @GrabYourPitchForks See info in area-owners.md if you want to be subscribed.

Issue Details

## Background and Motivation `SequenceReader` was fine tuned to give best performance for `ReadOnlySequence` and explicitly didn't support a second state constructed around a `ReadOnlySpan` directly. There is no practical way to turn a `ReadOnlySpan` into a `ReadOnlySequence`. Providing a reader that follows a similar pattern when reading `ReadOnlySpan` would help people reading spans directly and would allow further sub-parsing of spans returned from `SequenceReader` methods. ## Proposed API The proposed API follows along with `SequenceReader` with the following differences: - No `ReadOnlySequence` methods. - Lengths are `int` rather than `long` as spans are constrained to `int`. ``` c# namespace System.Buffers { public ref struct SpanReader where T: unmanaged, struct, IEquatable { public SpanReader(ReadOnlySpan span); public int Consumed { get; } public bool End { get; } public int Length { get; } public int Remaining { get; } public ReadOnlySpan UnreadSpan { get; } public void Advance(int count); public int AdvancePast(T value); public int AdvancePastAny(T value0, T value1); public int AdvancePastAny(T value0, T value1, T value2); public int AdvancePastAny(T value0, T value1, T value2, T value3); public int AdvancePastAny(ReadOnlySpan values); public void AdvanceToEnd(); public bool IsNext(T next, bool advancePast = false); public bool IsNext(ReadOnlySpan next, bool advancePast = false); public void Rewind(int count); public bool TryAdvanceTo(T delimiter, bool advancePastDelimiter = true); public bool TryAdvanceToAny(ReadOnlySpan delimiters, bool advancePastDelimiter = true); public bool TryCopyTo(Span destination); public bool TryPeek(out T value); public bool TryPeek(int offset, out T value); public bool TryRead(out T value); public bool TryReadTo(out ReadOnlySpan span, ReadOnlySpan delimiter, bool advancePastDelimiter = true); public bool TryReadTo(out ReadOnlySpan span, T delimiter, bool advancePastDelimiter = true); public bool TryReadTo(out ReadOnlySpan span, T delimiter, T delimiterEscape, bool advancePastDelimiter = true); public bool TryReadToAny(out ReadOnlySpan span, ReadOnlySpan delimiters, bool advancePastDelimiter = true); } public static class SpanReaderExtensions { TryReadBigEndian(this ref SpanReader, out short); TryReadBigEndian(this ref SpanReader, out int); TryReadBigEndian(this ref SpanReader, out long); TryReadLittleEndian(this ref SpanReader, out short); TryReadLittleEndian(this ref SpanReader, out int); TryReadLittleEndian(this ref SpanReader, out long); } } ``` ## Usage Examples ``` C# SpanReader reader = new(data); if (!reader.TryAdvanceTo(Identifiers.StartOfText, advancePastDelimiter: false)) { throw new InvalidDataException("No start text (STX) control character found."); } ushort checksum = CalculateFileChecksum(reader.UnreadSpan); reader.Advance(1); if (!reader.TryReadTo(out var designSpecification, Identifiers.FieldTerminator)) { throw new InvalidDataException("Did not find design specification."); } info.DesignSpecification = Encoding.ASCII.GetString(designSpecification); reader.AdvancePastAny(Identifiers.LineSeparators); ``` ## Alternative Designs Adding direct `ReadOnlySpan` support to `SequenceReader` is technically possible but would have a negative performance impact that is measurable in key web scenarios. It also would be hard for consumers to understand the implications of calling overloads that have `ReadOnlySequence` outputs (do we copy to an array?).

Author:	JeremyKuhne
Assignees:	-
Labels:	`api-suggestion`, `area-System.Buffers`, `untriaged`
Milestone:	-

omariom commented 3 years ago

public int Length { get; }

It could be nint. It supports C# arithmetic operators.

teo-tsirpanis commented 3 years ago

It could be nint.

A span's length is a 32-bit integer. Consequently, a SpanReader's length will never exceed int.MaxValue.

scalablecory commented 3 years ago

Doesn't seem to be much value here?

JeremyKuhne commented 3 years ago

@scalablecory Can you elaborate? I found having something to track "position" in parsing a span extremely useful and manually doing so cumbersome and error-prone.

JeremyKuhne commented 3 years ago

Note that this is something that came up in development of the original SequenceReader (which I wrote) that I intended to follow up on. Shifting teams caused this to back-burner for awhile. I'm using my own implementation of this in things I'm working on which motivated me to finally get this proposal created.

scalablecory commented 3 years ago

I found having something to track "position" in parsing a span extremely useful and manually doing so cumbersome and error-prone ... I'm using my own implementation of this in things I'm working on which motivated me to finally get this proposal created.

Can you give some usage examples showing how this reduces code complexity?

SequenceReader is useful because it crosses segment boundaries, but this looks like it'll just be a trivial wrapper around existing APIs.

JeremyKuhne commented 3 years ago

Can you give some usage examples showing how this reduces code complexity?

The key thing is that to do the same thing I'm showing in the proposal is that you would have to keep slicing yourself (or trying to not mess up your current index when you're slicing). But beyond that the API patterns are very convenient for parsing and aren't just single method calls. Additionally you can't back up a step easily when slicing over and over.

int index = data.IndexOf(Identifiers.StartOfText);
if (index == -1)
{
    throw new InvalidDataException("No start text (STX) control character found.");
}

data = data.Slice(index);
ushort checksum = CalculateFileChecksum(data);
data = data.Slice(1);

index = data.IndexOf(Identifiers.FieldTerminator);

if (index == -1)
{
    throw new InvalidDataException("Did not find design specification.");
}

info.DesignSpecification = Encoding.ASCII.GetString(data.Slice(0, index));
data = data.Slice(index + 1);

// etc... ugh

Updated the sample because I messed it up the first time.

FiniteReality commented 3 years ago

The main advantage of an API like this is that it makes the "read and move" code a lot simpler. For example:

if (!reader.TryReadByte(out var protocolVersion))
    throw new InvalidOperationException("Could not read protocol version");

int size = 0;

if (protocolVersion == 1)
{
    if (!reader.TryReadInt16LittleEndian(out var sizeShort))
        throw new InvalidOperationException("Failed to read packet size");
    size = sizeShort;
}
else if (protocolVersion == 2)
{
    if (!reader.TryReadInt32LittleEndian(out size))
        throw new InvalidOperationException("Failed to read packet size");
}

if (!reader.TryGetSpan(size, out var span))
    throw new InvalidOperationException("Failed to get packet data");

instead of:

if (span.Length < 1)
    throw new InvalidOperationException("Could not read protocol version");

var protocolVersion = span[0];
int size = 0;

if (protocolVersion == 1)
{
    if (!BinaryPrimitives.TryReadInt16LittleEndian(span, out var size))
        throw new InvalidOperationException("Failed to read packet size");

    span = span.Slice(sizeof(short));
}
else if (protocolVersion == 2)
{
    if (!BinaryPrimitives.TryReadInt32LittleEndian(span, out var size))
        throw new InvalidOperationException("Failed to read packet size");

    span = span.Slice(sizeof(short)); // bug!
}

if (span.Length < size)
    throw new InvalidOperationException("Failed to get packet data");

In the second snippet, it's very easy to accidentally introduce bugs as you have to explicitly advance the span. I've done this many times in my own code, and have ended up writing at least one SpanReader type to minimise these bugs.

zlatanov commented 3 years ago

I've myself had to use such a tool multiple times, but it was extremely easy to add the read methods I needed as extension methods.

using System;

ReadOnlySpan<byte> buffer = new byte[1000];
Console.WriteLine(buffer.ReadInt32());

static class ReadOnlySpanExtensions
{
    public static int ReadInt32(ref this ReadOnlySpan<byte> buffer)
    {
        int number = BitConverter.ToInt32(buffer);
        buffer = buffer[4..];

        return number;
    }
}

omariom commented 3 years ago

A span's length is a 32-bit integer. Consequently, a SpanReader's length will never exceed int.MaxValue.

Should Span start supporting more than int.MaxValue elements, nothing would need to change in the API.

teo-tsirpanis commented 3 years ago

Span itself would never support more elements because it would have been a breaking change for all the developers that use 32-bit integers to store the index while looping one.

Supporting more elements would need a LargeSpan<T> type with a nint or long Length, ~~and consequentially a LargeSpanReader<T> type to match it~~.

SpanReader could have its length in a nint, and once and if LargeSpan got added, it would transparently support both. But that depends on whether the .NET team is willing to add a LargeSpan type in the future.

Joe4evr commented 3 years ago

What the shape/solution could be for arrays/spans larger than Int32.MaxValue elements is already being discussed at length (pun semi-intended) at #12221 and currently that still seems to be a mostly speculative thread.

On topic: Seems like the shape is good enough for most cases. It probably won't replace the one usecase where I need custom look-ahead logic, but other cases can clean up nicely.

JeremyKuhne commented 3 years ago

It probably won't replace the one usecase where I need custom look-ahead logic

@joe4evr Can you elaborate on this? The current surface area was very much scenario driven- we came up with the original surface area (with SequenceReader) by trying to develop the System.Text.Json reader on it. Would be nice to continue to evolve these structs based on real scenarios where possible. :)

As to the large span scenario I'd be hesitant to respond to design that isn't finished. Wouldn't want to use nint if things eventually settled on long, for example. I'm also presuming getting said feature is a non-trivial amount of time in the future. Having a LargeSpanReader isn't a terrible thing if it means we can get something we can use in the near term.

JeremyKuhne commented 3 years ago

Here is the implementation: https://github.com/JeremyKuhne/runtime/commit/86ff403820b887d0216678c70b5f8a96c0fc458e

Joe4evr commented 3 years ago

@Joe4evr Can you elaborate on this?

Yeah, so my case is in parsing a chunk of text when encountering an open-brace char, I have a function that looks ahead to find the matching close-brace, keeping track of any nested open/close-brace pairs that may exist within, and returns the slice inside of the braces.

From a glance at the proposed API, I'm not sure if that's doable straight away, or even common enough to have it baked into the implementation (since I presume this is intended to be high-perf/"low-fat").

JeremyKuhne commented 3 years ago

@Joe4evr thanks for the scenario. I'll give it some thought as this is a not uncommon parsing scenario. Things get a bit awkward if you also support escaping so it isn't immediately clear what the surface area would look like.

Joe4evr commented 3 years ago

I do have escaping included, yes:

for (int i = startIdx; i < span.Length; i++)
{
    char current = span[i];
    if (current == '\\')
    {
        i += 1;
        continue;
    }

    // ....
}

wstaelens commented 2 years ago

Is there already a spanreader available? Would be very useful instead of manually keeping track of the current slice position. Passing a spanreader would be a lot easier 👍

wstaelens commented 1 year ago

Bump Is there already a SpanReader available?

wstaelens commented 11 months ago

Bump Is there already a SpanReader available?

https://sakno.github.io/dotNext/versions/2.x/api/DotNext.Buffers.SpanReader-1.html :-)

Trolldemorted commented 2 weeks ago

@JeremyKuhne was there a particular reason for SequenceReaderExtensions to only contain methods for signed integer types, and could we get unsigned variants here (and ideally also there)?

dotnet / runtime