datalust / superpower

A C# parser construction toolkit with high-quality error reporting
Apache License 2.0
1.05k stars 98 forks source link

Some way to check the next token #156

Closed labsin closed 1 year ago

labsin commented 1 year ago

Motivation

I'm writing a parser for a filter that has expressions. There are more characters allowed in TextString than in Identifier.

Minimal example:

new TokenizerBuilder<FilterToken>()
    .Ignore(Span.WhiteSpace)
    .Match(Character.EqualTo('='), FilterToken.Exact)
    .Match(Character.Letter.AtLeastOnce(), FilterToken.Identifier)
    .Match(Character.LetterOrDigit.Or(Character.EqualTo('.')).AtLeastOnce(), FilterToken.TestString)
    .Build();

Now any TestString will get matched as Identifier unless it has a period. If I change the order, all Identifiers can get matched as TestString.

Is there some supported way to check the remainder, but don't consume it?

I now implemented it like:

public static TextParser<T> Peek<T, U>(this TextParser<T> lhs, TextParser<U> rhs)
{
    return delegate (TextSpan input)
    {
        Result<T> result = lhs(input);
        if (!result.HasValue)
        {
            return Result.CastEmpty<T, T>(result);
        }

        Result<U> peek = rhs(result.Remainder);
        if (peek.HasValue)
            return result;

        return Result.CastEmpty<T, T>(result);
    };
}
private static readonly Tokenizer<FilterToken> Tokenizer =
    new TokenizerBuilder<FilterToken>()
        .Ignore(Span.WhiteSpace)
        .Match(Character.EqualTo('='), FilterToken.Exact)
        .Match(Character.Letter.AtLeastOnce().Peek(Character.EqualTo('=')), FilterToken.Identifier)
        .Match(Character.LetterOrDigit.Or(Character.EqualTo('.')).AtLeastOnce(), FilterToken.TestString)
        .Build();

Would a method like Peek be of use in the library?

nblumhardt commented 1 year ago

Hi! Thanks for dropping us a line. It sounds more like this one might be better addressed by rethinking the token types in your grammar; if TextString and Identifier are syntactically ambiguous, then collapsing those into a single token type might be the way to go? You can then distinguish between them later on in the parser, based on their contextual position in the input. Hope this helps!

labsin commented 1 year ago

There ended up being more differences in the real code than the example and the solution I posted is the simplest and most reasonable solution. So i'm now using the example I posted and it seems to work for all my user cases.

So if there is no feature to do this right now and no interest for this feature in the library, feel free to close the issue.

nblumhardt commented 1 year ago

Thanks, Sam 👍