Migrate the parser to the language definition v2

Xanewok commented 11 months ago

At the moment we use our Rust macros 1.0 (aka DSL v1) to define the language for our parser. We already have:

YAML spec
DSL v1
DSL v2

So to cut down the different definitions, we should migrate to the newest definition (see https://github.com/NomicFoundation/slang/pull/617).

Migration PR:

[X] #650

Impacts the AST:

[X] https://github.com/NomicFoundation/slang/pull/693
[X] https://github.com/NomicFoundation/slang/pull/698
[X] https://github.com/NomicFoundation/slang/pull/699
- Clean up PrecedenceExpression#rule_name and adapt the codegen model to fit the v2 definition better
- Precedence items should also produce their own NonTerminalKind kind, wrapping the single child (PrecedenceExpression or PrimaryExpression).
- PrecedenceExpression should get their own parse_XXX() function.

Does not impact the AST:

~[ ] Re-introduce EOF/EOI (End of Input) virtual token/trivia parser~
~[ ] Change LeadingTrivia definition in v2 to allow being terminated by EOI as well (changed in #650)~ (tracked in #1020)
[x] Don't monkey-patch SourceUnit rule to always be followed by EndOfFileTrivia (copies v1; trivia model in v2 is a bit different)
[x] Always try to parse ZeroOrMore(AnyTriviaItem) at the end when parsing any production with pub fn parse
[x] Stop leaking identifiers by adapting either v1 or v2 models (#977)
[x] Keyword trie inclusion should be reworked to not require synthetic rules over all keywords (v1 model) (#1018)
[x] Revert https://github.com/NomicFoundation/slang/pull/646 (done in #800)
[x] Ordering of parsing statements variants (context).~ (separated into #1019)
[x] See if we can remove Scanner::not_followed_by completely and automatically calculate it instead (context).
[x] Remove redundant versioning checks in precedence expressions (context).
[x] Precedence individual operators need to be deduped across versions (context).
[X] Delete codegen_grammar crate. (https://github.com/NomicFoundation/slang/pull/972)

Xanewok commented 4 months ago

While implementing #1003 I started thinking whether using v2 types only makes sense for our existing parser generator.

The goal of #1003 and this issue in general was to reduce duplication between the very similar types and get rid of the bridging types whenever possible.

While it was enabled by the mechanical and lexer-oriented design of model::Scanner, I'm worried that it might not be possible to do the same with ParserDefinitionNode because of the higher level of abstraction of the v2 types for structs, choices and precedence operators.

The are a couple of points here:

The v1 model for the parser definition was very generic and dynamic in a good way - I feel that trying to shoehorn the v2 to fit into it will make it harder to reason about and make it less composable
Moreover, because of the mutually recursive nature of the parser (and how v2 inherently differs from v1: flat list of definitions vs recursively defined parsers via DSL macro) we need to resolve/lower the v2 definition anyway into a structured representation
The structured representation is required to perform some analysis for the purposes of codegen, like reachable terminals in a given lexical context etc.

Having the above in mind, while there are some unnecessary translation hacks still that I'd like to smooth out, I would probably reconsider dropping the v1 interface entirely.

cc @AntonyBlakey @OmarTawfik for visibility

Xanewok commented 3 months ago

@OmarTawfik from what I understand:

we don't need the end of input token (first item) and
the "Change LeadingTrivia definition in v2 to allow being terminated by EOI as well" (second item) is somewhat subsumed by us trying to always parse the trivia after the end of the production.

If not, then, I'd propose that we treat it as general improvement to the trivia handling rather than being a core component of the DSL v2 (we did already migrate the parser to the DSL v2 and we're happy with the current behavior) and we should probably track it separately if we're not already happy with what we have, as per above.

The listed scope for this issue almost feels like a quarterly KR and I'd love if we could finally close it, once #1018 is merged (and we revert the string rule) :sweat_smile:

OmarTawfik commented 3 months ago

we don't need the end of input token (first item)

sounds good.

is somewhat subsumed by us trying to always parse the trivia after the end of the production.

IIRC, the hack we added is to "reuse" leading trivia as EOF trivia, and the proper fix was to construct a specific EOF parser in codegen, with its definition being Choice([all trivia items in the grammar]) and run it after each successful parse. This allows LeadingTrivia definition to remain true/accurate to DSL v2, and remove the TODO(#638) comments in the definition.

Is that still the case? If so, it is definitely not blocking, and I would not prioritize it over other important tasks we have, but I think we should still do it. If you have a full plate, please feel free to put it in a separate issue and assign it to me, and we can close this one.

Xanewok commented 3 months ago

I remember that we did discuss what to do with trailing trivia and that it's not so obvious, as for example

// ...snip
}

/* some multi
line comment */

would mean that this conceptually would belong to the } token and this prompts the question of how to format it and adds inconsistency as our trailing trivia in general cannot span multiple lines.

The way we have this setup is that the trailing trivia has to be a conceptually a subset of the leading trivia and thus the leading trivia is the trivia parser, which means if we want to accept all of the trivia items at EOF, then the leading trivia definition is correct for that use case and not a stretch, from what I understand (but, to be clear, we do have a "hack" of implicitly parsing the leading trivia rather than it having a dedicated definition in the definition.rs).

If we want to use the DSLv2's definition of (as added by this PR)

        ZeroOrMore(Choice([
            Trivia(Whitespace),
            Trivia(SingleLineComment),
            Trivia(MultilineComment)
        ])),
        Choice([Trivia(EndOfLine), EndOfInput])

then for it to be correct, we'd have to change it to something like:

    trailing_trivia = Choice([
        // A single line comment
        Sequence([
            Optional(Trivia(Whitespace)),
            Optional(Choice([
                Trivia(SingleLineComment),
                Trivia(MultilineCommentThatDoesNotContainAnEndOfLine)
            ])),
            Trivia(EndOfLine),
        ])
        // De facto end-of-file trivia
        // (separated in order to disallow multi-line trivias as trailing in the general case)
        Sequence([
            ZeroOrMore(Choice([
                /* trivias listed in leading_trivia */
            ])),
            Trivia(EndOfFile),
        ])
    ]),

At which point we're de facto reintroducing end_of_file_trivia, now that I think about it. Given that it's different enough from the regular trailing trivia, should we just bite the bullet and be more explicit about it to not confuse the user and be clear about expectations?

Introducing a dedicated notion of end_of_file trivia would also mean we don't need to introduce this virtual EndOfFile trivia/token, which we all agreed at some point to not have magical/synthesized token in the definition that can't be directly parsed.

@AntonyBlakey @OmarTawfik I don't think we did write down our different conclusions and rationale while we were discussing this on multiple occasions. If you're fine with it, I'd propose that we separate an issue/thread dedicated to it as I feel we had already too many items bundled in this issue and the decision is not clear-cut at this moment.

Xanewok commented 3 months ago

It seems that "Revert https://github.com/NomicFoundation/slang/pull/646" was already done in https://github.com/NomicFoundation/slang/pull/800.

Xanewok commented 3 months ago

I've opened #1020 to track this.

OmarTawfik commented 3 months ago

If you're fine with it, I'd propose that we separate an issue/thread dedicated to it as I feel we had already too many items bundled in this issue and the decision is not clear-cut at this moment.

Sounds good. Let's follow up on #1020. Thanks!

NomicFoundation / slang

Migrate the parser to the language definition v2 #638