davidfowl commented 5 years ago

These text parsers are very convenient when parsing text based protocols (like HTTP/1.1 and RESP). It would be great to have overloads of the existing APIs that support ReadOnlySequence<byte>. I end up writing code like this:

private static bool TryParse(in ReadOnlySequence<byte> bytes, out int value, out int bytesConsumed)
{
    if (bytes.IsSingleSegment)
    {
        if (!Utf8Parser.TryParse(bytes.FirstSpan, out value, out bytesConsumed))
        {
            return false;
        }
    }
    else
    {
        Span<byte> textSpan= stackalloc byte[128];
        bytes.CopyTo(textSpan);

        if (!Utf8Parser.TryParse(textSpan, out value, out bytesConsumed))
        {
            return false;
        }
    }

    return true;
}

I would be great if this was built into the existing parser.

public partial static class Utf8Parser
{
    public static bool TryParse(in ReadOnlySequence<byte> source, out bool value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out byte value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out DateTime value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out DateTimeOffset value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out decimal value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out double value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out Guid value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out short value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out int value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out long value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out sbyte value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out float value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out TimeSpan value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out ushort value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out uint value, char standardFormat = '\0');
    public static bool TryParse(in ReadOnlySequence<byte> source, out ulong value, char standardFormat = '\0');
}

cc @ahsonkhan @GrabYourPitchforks

ahsonkhan commented 5 years ago

Should we consider adding these APIs on SequenceReader instead (or potentially in addition to)?

Something like these extension methods: https://github.com/dotnet/corefxlab/blob/94248fb2ea9881936d6338b0d2d62ca8c79cbd79/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader_text.cs#L239-L275

The bytesConsumed parameter doesn't make too much sense for a SequencePosition. How about consumedPosition (or sequenceConsumed, consumedUntil, etc.)?

cc @JeremyKuhne

davidfowl commented 5 years ago

I originally had that but the problem is that with text based protocols are usually delimited. For example, say you wanted to parse an HTTP/1.1 response (e.g. HTTP/1.1 200 OK\r\n). You want to write code that looks like this:

// HTTP/1.1
if (!sequenceReader.TryReadTo(out ReadOnlySpan<byte> version, (byte)' '))
{
    return;
}

// 200
if (!sequenceReader.TryReadTo(out ReadOnlySpan<byte> statusCodeText, (byte)' '))
{
    return;
}
else if (!Utf8Parser.TryParse(statusCodeText, out int statusCode, out _))
{
    return;
}

// OK
if (!sequenceReader.TryReadTo(out ReadOnlySequence<byte> statusText, NewLine))
{
    return;
}

If we add these parser APIs, how do you avoid parsing the status code into an integer until you have the full payload?

// HTTP/1.1
if (!sequenceReader.TryReadTo(out ReadOnlySpan<byte> version, (byte)' '))
{
    return;
}

// 200
// This code is incorrect because it'll parse 2, 20, and 200 successfully. So we need to scan for a space before we start parsing to make sure we have the full payload.
if (!sequenceReader.TryParse(out int statusCode))
{
    return;
}

// OK
if (!sequenceReader.TryReadTo(out ReadOnlySequence<byte> statusText, NewLine))
{
    return;
}

GrabYourPitchforks commented 5 years ago

(Standard grumbling that this belongs on an AsciiParser class, not a Utf8Parser class, since presumably you want these APIs to fail if they encounter non-ASCII data.)

The APIs seem reasonable. We'd have to be a little careful with the implementation because each method would need to know the worst-case input length based on the out T and the char standardFormat parameters. It's likely that any introduced APIs will have bugs in edge case handling, especially for types like double where the input buffer could be arbitrarily long. That doesn't mean we shouldn't do them, but it does potentially mean we should consider carefully which overloads we want to ship with.

scalablecory commented 5 years ago

I'd love to find a more general way to tackle this. Adding these convenience overloads for everything parsing a Span will get tedious, and it would be tacit encouragement to parse ReadOnlySequence directly.

scalablecory commented 5 years ago

I don't know if I'd want it in corefx but an extension on ReadOnlySequence could make this simple:


// return FirstSpan if single element, otherwise copy to storage and return storage.
ReadOnlySpan<byte> ReadOnlySequence.MaybeCopyTo(Span<byte> storage);

Utf8Parser.TryParse(byteSequence.MaybeCopyTo(stackalloc byte[128]), out int statusCode, out int bytesConsumed);

davidfowl commented 5 years ago

I'd love to find a more general way to tackle this. Adding these convenience overloads for everything parsing a Span will get tedious, and it would be tacit encouragement to parse ReadOnlySequence directly.

Case by case seems fine but if we're looking at language features then I rather not block that to add these very simple overloads that I keep re-writing 😄 .

davidfowl commented 5 years ago

The APIs seem reasonable. We'd have to be a little careful with the implementation because each method would need to know the worst-case input length based on the out T and the char standardFormat parameters. It's likely that any introduced APIs will have bugs in edge case handling, especially for types like double where the input buffer could be arbitrarily long. That doesn't mean we shouldn't do them, but it does potentially mean we should consider carefully which overloads we want to ship with.

We're just delegating to the Span overloads for most of these (I would assume all check for overflow). Isn't double the only special case? Doesn't it already have to be handled by the span overload?

GrabYourPitchforks commented 5 years ago

Isn't double the only special case? Doesn't it already have to be handled by the span overload?

Not sure if it's the only special case. And the span overloads can assume that they have all the data, which means that edge case handling is simpler. As an example of something that might be error-prone: what happens if the input is a 100GB stream consisting of all zeroes and you pass such a stream to the double parser? We can't delegate to the span parser because it can't handle payloads that large.

Again, not an argument that we shouldn't do it, but more of an admission that the more we try to support the higher the odds of this being a bug farm. So we'd want to weigh carefully the tradeoffs of having an API that works directly with sequences vs. telling customers to find the delimiters themselves, slice, and use the span-based APIs.

davidfowl commented 5 years ago

Not sure if it's the only special case. And the span overloads can assume that they have all the data, which means that edge case handling is simpler.

Why can't these overloads assume the same thing?

As an example of something that might be error-prone: what happens if the input is a 100GB stream consisting of all zeroes and you pass such a stream to the double parser? We can't delegate to the span parser because it can't handle payloads that large.

What happens if you do that for the span overload?

GrabYourPitchforks commented 5 years ago

What happens if you do that for the span overload?

If you pass int.MaxValue worth of zeroes to the span overload? It should handle them just fine and return 0.0.

davidfowl commented 5 years ago

Constrain the input size to int.MaxValue for ReadOnlySequence

jkotas commented 5 years ago

For example, say you wanted to parse an HTTP/1.1 response (e.g. HTTP/1.1 200 OK\r\n). You want to write code that looks like this:

Is this example how you imagine writing production quality high-performance HTTP response parser, e.g. are you going to use these APIs in Kestrel if we add them?

davidfowl commented 5 years ago

That would be ideal if the performance drop wasn't terrible. We started to write an HTTP client for performance testing here https://github.com/sebastienros/httpsocket/blob/38f3ee6d4a357fa60fec4a8b6cf802ce7e921924/src/HttpSocket.cs#L146-L175 and it uses the above code.

jkotas commented 5 years ago

So what is this code fragment going to look like with this API?

davidfowl commented 5 years ago

I would use the ReadOnlySequence overload instead I get the data instead of ReadOnlySpan.

davidfowl commented 5 years ago

This isn’t the main scenario though, it’s when you have parsed some sequence of date then you need to further parse and slice to get the right section, then you want to turn it into some primitive

jkotas commented 5 years ago

And would you still use SequenceReader? I think we have way too many of these parsing helper APIs and it is getting very difficult to figure out what is the right combination to use for any given situation.

davidfowl commented 5 years ago

And would you still use SequenceReader? I think we have way too many of these parsing helper APIs and it is getting very difficult to figure out what is the right combination to use for any given situation.

I don't think it's complicated (having written a bunch of these parsers now). The main problem is that nothing takes a ReadOnlySequence so once you've parsed and narrowed your search, you're left with the complicated logic to materialize the bytes into something else. It's very much like trying to use Span<T> on .NET Standard 2.0.

e.g. https://github.com/aspnet/AspNetCore/blob/271ebe019823e8e6a751429cd350c2db76ebab05/src/Http/WebUtilities/src/FormPipeReader.cs#L316

(we just added these thanks @GrabYourPitchforks!)

e.g. https://github.com/sebastienros/httpsocket/blob/38f3ee6d4a357fa60fec4a8b6cf802ce7e921924/src/HttpSocket.cs#L254-L260

This is the perfect example, parsing a chunked prefix in HTTP/1.1. The payload looks like this:

a\r\n
Helloworld\r\n
0\r\n
\r\n

I want to parse the hex number (a for example) after grabbing all text up to the newline.

GrabYourPitchforks commented 5 years ago

So the code pattern is that the caller slices the input based on some delimiter (such as CRLF), then passes the sliced sequence to these parsing APIs? That is somewhat different than the behavior of the span-based APIs, which consume input until a delimiter is encountered.

davidfowl commented 5 years ago

So the code pattern is that the caller slices the input based on some delimiter (such as CRLF), then passes the sliced sequence to these parsing APIs?

In these particular examples that are handling streaming data it has to be done this way. Parsing partial integers would succeed so we need to get the entire "frame" first.

That is somewhat different than the behavior of the span-based APIs, which consume input until a delimiter is encountered.

Yes, I guess that's why the bytesConsumed parameter exists. Generally though this approach applies here as well when using these APIs in the same way. I now see your original concern about overflow and edge cases, but that can be mitigated by having a default limit on the input (even though it's a ReadOnlySequence).

GrabYourPitchforks commented 5 years ago

So what we're really talking about is a behavioral equivalent to int.TryParse, not Utf8Parser.TryParse(ROS<byte>, out int)?

davidfowl commented 5 years ago

@GrabYourPitchforks I think that would be reasonable yes. That way we can call safely delegate to the the Span overloads. I'd prefer if we kept these implementations super cheap to implement. So far we've scoped it:

Fully formed primitives (no partial data allowed)
Limiting the max input size (to something reasonable to prevent abuse or weird edge cases)

Do we also need to add these overloads for the ReadOnlySpan<byte> variant (maybe it could be a different request)? I've updated the issue to reflect this.

GrabYourPitchforks commented 5 years ago

Just to set baseline expectations, these are not going to be cheap to implement if you're looking for behavioral parity with existing TryParse methods. Any length restriction we impose would be artificial (and honestly could be done by the caller without us writing a helper method), so the likely result is that we'll end up duplicating the existing parsing logic into these methods.

davidfowl commented 5 years ago

Just to set baseline expectations, these are not going to be cheap to implement if you're looking for behavioral parity with existing TryParse methods. Any length restriction we impose would be artificial (and honestly could be done by the caller without us writing a helper method), so the likely result is that we'll end up duplicating the existing parsing logic into these methods.

OK, well that's unfortunate but I don't really have any push back. The only unfortunate thing here basically leads back to what @scalablecory mentioned. There's no way to have an efficient byte enumerator that would be more efficient with span and less efficient with ReadOnlySequence/SequenceReader. That might be worth looking into if we end up needing to duplicate all of these methods.

@AndyAyersMS @jaredpar @jkotas Here's a question for you both:

bool TryParseInt(ByteEnumerator enumerator, out int value)
{
   int result = 0;
   while (enumerator.TryGetNext(out byte b))
   {
      var digit = b - '0';
      result = result * 10 + digit;
   }
   return result;
}

We need a language and JIT and compiler feature that lets me write code like the above but can then turn it into the equivalent of:

bool TryParseInt(Span<byte> span, out int value)
{
   int result = 0;
   for (var i = 0; i < span.Length; i++)
   {
     var digit = span[i] - '0';
     result = result * 10 + digit;
   }
   return result;
}

If the ByteEnumerator is a Span<byte>

GrabYourPitchforks commented 5 years ago

Agree - having something like the ref interface proposal would definitely help here. Then we could potentially reuse the same logic for Span and ReadOnlySequence and rely on generic specialization to make things more efficient.

scalablecory commented 5 years ago

@davidfowl @GrabYourPitchforks C++ enables you to do this with templates and iterators and I've employed it very successfully in the past for high-perf parsing. I'd love it we could get better parity there.

When I tried to do this myself in C# I hit two roadblock features I needed:

enable ref struct to implement interfaces.
add a ref struct generic constraint that enforces "you must use this generic as if it were a ref struct". Then you can do e.g.

ref struct SpanEnumerator : IByteEnumerator
{
   // ...
}

struct SequenceEnumerator : IByteEnumerator
{
   // ...
}

bool TryParseHelper<ByteEnumerator>(ByteEnumerator enumerator, out int value) where ByteEnumerator : IByteEnumerator, ref struct
{
   // ...
}

bool TryParse(ReadOnlySequence<T> sequence, out int value)
{
   return TryParseHelper(CreateSpanEnumerator(sequence.FirstSpan), out value)
      || TryParseHelper(CreateSequencEnumerator(sequence), out value);
}

I briefly touched on this in https://github.com/dotnet/csharplang/issues/1479#issuecomment-463488171

C++ also makes it really easy via traits/ADL/SFINAE/etc. to automatically select e.g. the most efficient IndexOf for an iterator type -- I'd love that for this as well, but that is significantly more complex of a feature to ask for :)

jkotas commented 5 years ago

I agree ref interfaces or ref constrains would be useful.

I have doubts whether the specific helper methods discussed here are useful enough to be in the core framework. I believe that there are two mainstream categories of parsers to worry about:

High-performance production-grade parsers always work against buffers for number of difference reasons: look ahead, context to give good error messages, etc. Helper method like these are not useful for high-performance production-grade parsers.
Lower-performance handrolled to get the job done. The building block for these is typically a reader object (like TextReader) that is optimized for convenience, discovery and ease of use.

The method proposed here do not seem to be either category.

davidfowl commented 5 years ago

@jkotas I agree that ReadOnlySequence<T> isn't performant for parsing. It's job was to be an representation of a single or multi-segment buffer that would be as efficient as possible while in a single buffer, but pay the cost when you needed to span buffers. While that came with some tradeoffs, we have this type in the BCL now and I'd like more ways to interact with them in the BCL natively without having to write helpers to do the most basic parsing.

Helper method like these are not useful for high-performance production-grade parsers.

That's subjective. These helpers could be very fast in the case of parsing a single segment ROS and slower when we need to cross buffers. Once you've decided to represent a buffer with this type you're already in that world so I don't think this argument holds true.

No matter how you feel about that, users have to write the above code today and I want to remove that need in the next version.

davidfowl commented 5 years ago

Any length restriction we impose would be artificial (and honestly could be done by the caller without us writing a helper method)

@GrabYourPitchforks prior ART https://github.com/dotnet/corefxlab/blob/94248fb2ea9881936d6338b0d2d62ca8c79cbd79/src/System.Buffers.ReaderWriter/System/Buffers/Reader/BufferReader_text.cs#L15

We'll see if anybody complains 😄

EDIT: Wait, that was corefxlab 😄

jkotas commented 5 years ago

That's subjective. These helpers could be very fast

It is not just about performance. It is also about usability. What would help to convince me if you show me a delta how these methods will be used in Utf8JsonReader, Kestrel (or some other production grade parser). The motivating example was from a test code and it was not very convincing.

No matter how you feel about that, users have to write the above code today and I want to remove that need in the next version.

We should focus on designing scenario: Parsing data coming from ReadOnlySequence. We have a SequenceReader for that today. I find it confusing that we are not improving SequenceReader and instead trying to add an alternative way for parsing ReadOnlySequence that is not SequenceReader . How are the users of the platform supposed to pick the right way to parse the ReadOnlySequence ?

Compare this with the simple story for TextReader that does not have ambiguity like this.

davidfowl commented 5 years ago

It is not just about performance. It is also about usability.

The general pattern for using SequenceReader is to the methods to narrow down a Span or Sequence of content to further parse. We have many examples of production parsers already described in this thread.

Kestrel's header parser https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L180 (ignore the pointers, those will go away soon 😄)

The Form Parser https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Http/WebUtilities/src/FormPipeReader.cs#L219

The motivating example was from a test code and it was not very convincing.

I don't really understand why it wasn't, pretend that code was in HttpClient, that's what we're currently building. It happens to be a client used for performance testing but I'm not sure why that makes a difference.

We should focus on designing scenario: Parsing data coming from ReadOnlySequence. We have a SequenceReader for that today. I find it confusing that we are not improving SequenceReader and instead trying to add an alternative way for parsing ReadOnlySequence that is not SequenceReader . How are the users of the platform supposed to pick the right way to parse the ReadOnlySequence ?

I thought I clarified that but maybe I was unclear. First the specific example:

When parsing an HTTP/1.1 response start line, the tokens are split by spaces then a CRLF. The second token is an integer.

Can you tell me how that parsing would look if we added TryParseInt to SequenceReader?

// Find the space
if (sequenceReader.TryReadTo(out ReadOnlySequence<byte> statusCodeText, (byte)' '))
{
     // Parse the int
     var subReader = new SequenceReader<byte>(statusCodeText);
     if (subReader.TryParseInt(out statusCode))
     {
     }
}

That's syntactically noisy, expensive and unnecessary.

Compare this with the simple story for TextReader that does not have ambiguity like this.

TextReader can a read a line, the entire payload or blocks or chars. It really doesn't do anything like what we're describing here (other than SequenceReader and TextReader sharing the Reader suffix).

Alternative Consideration:

I want to summarize the patterns I've seen in the 3 years of using these APIs "in anger" writing and reviewing lots of parsers (both for play and production). There are only a handful of patterns that need supporting to work for a extremely LARGE range of scenarios:

SequenceReader IMO solves the problem of walking the ReadOnlySequence<byte> in sane way, once you have narrowed your way down to a single token, the next step is to materialize into some other form. If It's unclear what I mean, here are some examples:

Utf8JsonReader parsing the value "true" from a ReadOnlySequence (https://github.com/dotnet/corefx/blob/2730c1b14bc089468f37f7ef8adb730246787d73/src/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.MultiSegment.cs#L515) - This code inlines all of the logic for maximum performance.
The FormParser parsing a form name - https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Http/WebUtilities/src/FormPipeReader.cs#L316-L344
Materializing a Header - https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L293
Parsing an HTTP/2 header frame - https://github.com/aspnet/AspNetCore/blob/dba39d60c89b743d45411ba6bc6b02b5a8019032/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameReader.cs#L43-L44

IMO I do think the ideal solution is to get to a span as quickly as possible, then parse (if the data isn't large),

The options are:

Parse byte by byte or span by span. Some parsers do this (like JSON) and it can be faster but results in re-implementing various primitive parsers. This is usually still required when parsing a custom primitive format we don't ship OOTB.
Add more overloads for things to take ReadOnlySequence, make those APIs do the fast path slow path thing - This is the most usable but requires adding alot more overloads.
Add more overloads to SequenceReader, but that means making a SequenceReader from a ReadOnlySequence when we want to create a primitive type (see my comments above about this being wasteful).
Copy the ReadOnlySequence to a Span, then use that span to parse. Incurs copies potentially on every read which defeats some of the purpose of this approach.
Copy only in the multi-segment case (this has hybrids like stackalloc if small enough and array pool otherwise, will will use the heap if it's too big).

Hybrids of the above approaches are also valid and I've seen them in the wild.

Maybe it's reasonable for primitives like this to incur the "maybe copy" by using overloads like this https://github.com/dotnet/corefx/issues/42252. If we combine this with a further API to do what @scalablecory suggests (but maybe on the reader as well), it might be the best of all worlds but I want to get your feedback.

Thoughts?

jkotas commented 5 years ago

I do think the ideal solution is to get to a span as quickly as possible, then parse

Agree.

Parse byte by byte or span by span. Some parsers do this (like JSON) and it can be faster but results in re-implementing various primitive parsers.

Yes, that is the way to implement high-performance production-grade parsers. I do not think any amount of helper APIs is going to change that.

Copy the ReadOnlySequence to a Span Copy only in the multi-segment case

Yes, I think APIs along these lines are the most powerful building blocks for lower-performance handrolled parsers that just need to get the job done.

For example, I think it would be useful to have ReadLine API for SequenceReader<byte/char> that returns Span and allocates the buffer to cover the splits as necessary. This Span can be then sliced and diced as necessary using regular Span APIs, including number parsing Span APIs. This is the same pattern one uses with TextReader today.

jaredpar commented 5 years ago

I agree ref interfaces or ref constrains would be useful.

In terms of language design ref interfaces / constraint is a fairly straight forward extension of our existing work. The ref constraint in particular was discussed at the time of ref struct and the behavior is mostly understood.

The tricky part is, as per usual, all features start at -100 so need to justify the cost is worth it. :smile:

I have a doc I'm already working on for improvements to ref structs that I hope to start sharing around in late November (just booked solid until then). I will add ref constraints and interfaces to the list.

jaredpar commented 5 years ago

@davidfowl

Instead of

bool TryParseInt(IByteEnumerator enumerator, out int value)

I think we'd effectively need

bool TryParseInt<T>(T enumerator, out int value)
  where T: ref struct, IByteEnumerator

Note: changed ByteEnumerator to IByteEnumerator to make it more clear this is meant to be an interface (at least in my sample)

The method can't be non-generic because that would imply that IByteEnumerator is a boxed value. Even if we supported ref interface instead of allowing ref struct to implement normal interfaces my expectation is a non-generic instance would always require some form of boxing and hence be illegal.

Is that aligned with everyone else's expectations here?

scalablecory commented 5 years ago

@jaredpar yes.

davidfowl commented 5 years ago

@jaredpar yes.

davidfowl commented 5 years ago

@jkotas

Yes, that is the way to implement high-performance production-grade parsers. I do not think any amount of helper APIs is going to change that.

I would beware about how we use the term "production-grade", I would swap that with high performance. But I agree. Arguably, byte by byte parsers aren't any more difficult with ReadOnlySequence/SequenceReader. The forward only nature does make it hard to jump around though.

See https://github.com/dotnet/corefx/blob/279292804fa4f6cb856577e59246ac934a2ef9e5/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnection.cs#L1262, which uses Array.IndexOf to scan for tokens.

Yes, I think APIs along these lines are the most powerful building blocks for lower-performance handrolled parsers that just need to get the job done.

I think the reader already exposes enough to do this.

For example, I think it would be useful to have ReadLine API for SequenceReader<byte/char> that returns Span and allocates the buffer to cover the splits as necessary.

What happens if the line is really long? Just allocate a long byte[] or return another ReadOnlySequence<byte/char>? That's the crux of the issue here. When you've found a single token, what form do you get it in and how to do materialize it. Your answers still haven't covered that scenario.

This Span can be then sliced and diced as necessary using regular Span APIs, including number parsing Span APIs. This is the same pattern one uses with TextReader today.

The difference being that SequenceReaeder doesn't have to allocate a buffer, that might be the difference. It's similar to the question above, which remains unanswered. If tokens are small enough, Span is fine, otherwise you can get another ReadOnlySequence and futher slice that up into spans. The problem with the latter approach is that it requires getting another SequenceReader.

As an example:

Key2=Value1\r\n
Key2=Value2\r\n
\r\n

Take this imaginary format, you can write the following code:

void ParsePair(ref SequenceReader<byte> reader, out string key, out string value)
{
    // This may allocate
    reader.TryReadTo(out ReadOnlySpan<byte> keySpan, '='); 

    // This may allocate
    reader.TryReadTo(out ReadOnlySpan<byte> valueSpan, NewLine);

    key = Encoding.UTF8.GetString(keySpan);
    value  = Encoding.UTF8.GetString(valueSpan);
}

OR you can get the entire line:

void ParsePair(ref SequenceReader<byte> reader, out string key, out string value)
{
    // After we add https://github.com/dotnet/corefx/issues/42252
    reader.TryReadTo(out ReadOnlySpan<byte> line, NewLine); // This might allocate

    int eq = line.IndexOf('=');

    ReadOnlySpan<byte> keySpan = s[0..eq];
    ReadOnlySpan<byte> valueSpan = s[(eq + 1)..^2];

    key = Encoding.UTF8.GetString(keySpan);
    value  = Encoding.UTF8.GetString(valueSpan);
}

I think either approach is fine but one may be more prone to allocations depending on the maximum size of the token and the segment size of the ReadOnlySequence.

davidfowl commented 5 years ago

Here's a summary of the 3 approaches I think we could take and the pros and cons of each:

1. ReadOnlySpan all the things

Pros:

No need to add overloads, ReadOnlySpan<byte/char> is enough
Code is simpler for the caller
Cons:
Potentially large allocations depending on
1. The maximum token size (imagine a single large token e.g. 32K bytes long)
2. The size of segments in the ReadOnlySequence

void Parse(ref SequenceReader<byte> reader)
{
    // This might allocate a large byte[]
    if (!reader.TryReadTo(out ReadOnlySpan<byte> line, NewLine))
    {
        return;
    }

    if (Utf8Parser.TryParse(line, out int value, out int bytesConsumed))
    {
        line = line.Slice(bytesConsumed);
    }
}

2. SequenceReader all the things

Pros:

Does not allocate

Cons:

Expensive to create another SequenceReader on the line (it is a big struct)
Requires adding lots of methods to SequenceReader to parse textual data
May not be as efficient as ReadOnlySpan<byte/char> equivalent

void Parse(ref SequenceReader<byte> reader)
{
    // This doesn't allocate
    if (!reader.TryReadTo(out ReadOnlySequence<byte> line, NewLine))
    {
        return;
    }

    var lineReader = new SequenceReader<byte>(line);

    if (lineReader.TryParse(line, out int value))
    {
        // 0 allocations
    }    
}

3. ReadOnlySequence all the things

Pros:

Does not allocate

Cons:

Requires adding overloads that take ReadOnlySequence<byte/char> to various parts of the BCL
May not be as efficient as ReadOnlySpan<byte/char> equivalent

void Parse(ref SequenceReader<byte> reader)
{
    // This doesn't allocate
    if (!reader.TryReadTo(out ReadOnlySequence<byte> line, NewLine))
    {
        return;
    }

    if (Utf8Parser.TryParse(line, out int value, out SequencePosition bytesConsumed))
    {
        line = line.Slice(bytesConsumed);
    }
}

jkotas commented 5 years ago

Requires adding lots of methods to SequenceReader to parse textual data Requires adding overloads that take ReadOnlySequence<byte/char> to various parts of the BCL

This is same "lots of methods" in both cases, right?

davidfowl commented 5 years ago

@jkotas but different because one is all on a single type VS being on various types in the BCL.

jkotas commented 5 years ago

You need a concrete T for these parsers that makes it impossible to add them as instance methods to SequenceReader<T> itself. So they would be extension methods, spread over as may types as you would like.

davidfowl commented 5 years ago

You need a concrete T for these parsers that makes it impossible to add them as instance methods to SequenceReader itself. So they would be extension methods, spread over as may types as you would like.

Logically they hang off SequenceReader<T>, which is different than changing lots of other places.

davidfowl commented 5 years ago

I spent a little time writing a something to parse a basic http/1.1 request start line, no error handling, best case scenario and got these results:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015523
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.52202, CoreFX 4.700.19.52317), X64 RyuJIT
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.52202, CoreFX 4.700.19.52317), X64 RyuJIT

|            Method |     Mean |    Error |   StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |---------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
|     SequenceParse | 161.1 ns |  3.22 ns |  3.16 ns |  1.25 |    0.04 | 0.0477 |     - |     - |     200 B |
| StreamReaderParse | 498.2 ns | 26.79 ns | 32.90 ns |  3.89 |    0.25 | 0.8678 |     - |     - |    3632 B |
|         SpanParse | 128.6 ns |  2.65 ns |  2.95 ns |  1.00 |    0.00 | 0.0477 |     - |     - |     200 B |
| SequenceFastParse | 148.1 ns |  3.01 ns |  4.22 ns |  1.16 |    0.04 | 0.0477 |     - |     - |     200 B |
|   ByteByByteParse | 205.9 ns |  3.57 ns |  3.34 ns |  1.60 |    0.05 | 0.0401 |     - |     - |     168 B |

https://gist.github.com/davidfowl/323a5a037700ad49692b97c0afff69f6

The StreamReader is obviously a bit of a troll but I wanted to show that even though the SequenceReaeder is slower than the direct Span API, it doesn't allocate it's possible to write pretty convenient APIs that are reasonably fast.

The SequenceFastParse also shows that with a little more effort, it can be brought a bit closer to Span performance at the cost of a bit more complex code (slow path, fast path).

GrabYourPitchforks commented 4 years ago

@davidfowl The benchmark sample you linked doesn't use the TryParse methods you were proposing. Was this intentional?

davidfowl commented 4 years ago

Yes (but I can add it if need be), the conversation is now about what "production ready" parsers look like. I stand by what i said earlier in this thread and I just wanted to have some concrete data to back it up.

I want to close on a strategy based on these options here https://github.com/dotnet/corefx/issues/42255#issuecomment-549705509 and then push forward with additions in that direction.

scalablecory commented 4 years ago

@davidfowl I think your byte-by-byte parser isn't realistic. It should have an implementation similar to SequenceReader.

davidfowl commented 4 years ago

@scalablecory Send me the codez and I'll update the benchmark.

scalablecory commented 4 years ago

@davidfowl Here's more what I had in mind -- faster than your version, but JIT unfortunately defeats much of the benefit... this is basically "free" in C++ but has considerable overhead here. Imagine the two helper methods are a single generic one (they have same code).

https://gist.github.com/scalablecory/4b1320044b911c5795531731d3d86f6a

davidfowl commented 4 years ago

@GrabYourPitchforks to your point, will add an example of the a basic Utf8Parser.TryParse with the various techniques to help answer @jkotas 's questions around performance. I'm coming back to just adding the originally proposed overloads based on the existing data but lets try some more things.

jkotas commented 4 years ago

@jkotas 's questions around performance

The question for me is whether the cost of these new APIs (confusion about what are the right parsing APIs to use, code-bloat associated with having yet another set of parsing APIs for everything, ...) is worth it to avoid some GC heap allocations in less common cases.

dotnet / runtime

API Proposal: Utf8Parser overloads for ReadOnlySequence<byte> #31357

1. ReadOnlySpan all the things

Pros:

Cons:

2. SequenceReader all the things

Pros:

Cons:

3. ReadOnlySequence all the things

Pros:

Cons: