dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

Recreating D's identifier-delimited string literals #342

Closed ichordev closed 10 months ago

ichordev commented 10 months ago

D has delimited strings, which can start and end with an arbitrary identifier:

writeln(q"MESSAGE
The line-break before this message is
consumed, but the one after it is not.
MESSAGE");

Is it possible to parse something like this in Pegged? And if so, how?

Keep in mind that the string only ends if the identifier matches the one at the start of the string:

writeln(q"HELLO
Hey there!
GOODBYE"
The string is still going, but GitHub has
profoundly terrible syntax highlighting! :)
HELLO");
veelo commented 10 months ago

Not directly, unless you know the delimiter at compile time:

/+dub.sdl:
dependency "pegged" version="~>0.4.9"
+/
import pegged.grammar;
import std.stdio;

mixin(grammar(`
DelimitedString(Delimiter):
    String        <~ :"q" :doublequote :Delimiter :eol (!(eol Delimiter) .)* eol :Delimiter :doublequote
`));

void main()
{
    auto parseTree = DelimitedString!(literal!"HELLO")(`q"HELLO
This is a HELLO delimited string
HELLO"`);

    writeln(parseTree);
}

Prints:

DelimitedString[0, 47]["This is a HELLO delimited string\n"]
 +-DelimitedString.String[0, 47]["This is a HELLO delimited string\n"]

If you don't know the delimiter at compile time, it would be possible to use a semantic action to ensure that the Delimiter and MatchingDelimiter are equal, but since delimited strings may have unmatched double quotes in them, there would be no way to know when to stop consuming input.

But I think it is possible to write a parser for delimited strings by hand (or steal it from the DMD lexer) and use that in a grammar like I use the predefined parsers doublequote and eol above. See also grammar composition. I have never tried that, though.

veelo commented 10 months ago

It seems there is a Wiki page for that already: https://github.com/PhilippeSigaud/Pegged/wiki/User-Defined-Parsers

veelo commented 10 months ago

So the answer is YES!

A working example: [EDITED for offset error]

/+dub.sdl:
dependency "pegged" version="~>0.4.9"
+/
import pegged.grammar;
import std.stdio, std.algorithm;

void main()
{
    auto parseTree1 = Strings(`q"(foo(xxx))"`);
    writeln(parseTree1);

    auto parseTree2 = Strings(`q"/foo]/"d`);
    writeln(parseTree2);

    auto parseTree3 = Strings(`q"MESSAGE
The line-break before this message is
consumed, but the one after it is not.
MESSAGE"`);
    writeln(parseTree3);

    auto parseTree4 = Strings(`q"HELLO
This is a HELLO delimited string.
Double quotes " may be unbalanced!
HELLO"`);
    writeln(parseTree4);
}

mixin(grammar(`
Strings:
    String  <- ( DelimitedString('(', ')')
               / DelimitedString('[', ']')
               / DelimitedString('{', '}')
               / DelimitedString('<', '>')
               / delimitedString ) StringPostfix?

    DelimitedString(Delimiter, MatchingDelimiter) <~ :"q" :doublequote :Delimiter
              DelimitedCharacters(Delimiter, MatchingDelimiter)*
              :MatchingDelimiter :doublequote
    DelimitedCharacters(Delimiter, MatchingDelimiter) <- Delimiter DelimitedCharacters(Delimiter, MatchingDelimiter)* MatchingDelimiter
                                                       / WysiwygCharacter(MatchingDelimiter) DelimitedCharacters(Delimiter, MatchingDelimiter)
                                                       / WysiwygCharacter(MatchingDelimiter)
    WysiwygCharacter(MatchingDelimiter) <- !MatchingDelimiter .
    StringPostfix <- "c" / "w" / "d"
`));

// Our user defined parser handles run-time delimiters:
@safe ParseTree delimitedString(ParseTree p) pure nothrow
{
    if (p.end + 3 < p.input.length &&
        p.input[p.end..p.end+2] == `q"`)
    {
        try
        {
            if (!or!(charRange!('a','z'),charRange!('A','Z'))(p.input[p.end+2..p.end+3]).successful)
            { // q"/foo]/"
                const delim = p.input[p.end+2..p.end+3];
                auto end = p.input[p.end+3 .. $].countUntil(delim);
                if (end < 0) goto fail;
                end += p.end+3;
                return ParseTree("delimitedString",
                                 true,
                                 [p.input[p.end+3 .. end]],
                                 p.input,
                                 p.end,
                                 end + 2);
            }
            else
            { // heredoc
                    auto delimEnd = p.input[p.end..$].countUntil('\n');
                    if (delimEnd < 3) goto fail;
                    delimEnd += p.end;
                    const delim = p.input[p.end+2 .. delimEnd];
                    auto matchingDelimStart = p.input[delimEnd..$].countUntil('\n' ~ delim);
                    if (matchingDelimStart < 0) goto fail;
                    matchingDelimStart += 1 + delimEnd;
                    return ParseTree("delimitedString",
                                    true,
                                    [p.input[delimEnd+1..matchingDelimStart]],
                                    p.input,
                                    p.end,
                                    matchingDelimStart + delim.length + 1);
            }
        } catch (Exception) goto fail;
    }
fail:
    return ParseTree("delimitedString", false, ["delimited string"], p.input, p.end, p.end);
}
@safe ParseTree delimitedString(string input) pure nothrow
{
    return delimitedString(ParseTree("", false, [], input));
}
@safe string delimitedString(GetName g) pure nothrow
{
    return "delimitedString";
}

Prints

Strings[0, 13]["foo(xxx)"]
 +-Strings.String[0, 13]["foo(xxx)"]
    +-Strings.DelimitedString!(literal!("("), literal!(")"))[0, 13]["foo(xxx)"]

Strings[0, 10]["foo]", "d"]
 +-Strings.String[0, 10]["foo]", "d"]
    +-Strings.StringPostfix[9, 10]["d"]

Strings[0, 96]["The line-break before this message is\nconsumed, but the one after it is not.\n"]
 +-Strings.String[0, 96]["The line-break before this message is\nconsumed, but the one after it is not.\n"]

Strings[0, 84]["This is a HELLO delimited string.\nDouble quotes \" may be unbalanced!\n"]
 +-Strings.String[0, 84]["This is a HELLO delimited string.\nDouble quotes \" may be unbalanced!\n"]
ichordev commented 10 months ago

So the answer is YES!

A working example: [EDITED for offset error] [...]

This is very helpful, thank you! I should be able to use your example there as a starting point for how to read the ParseTree with UDFs. :)