darklang / dark

Darklang main repo, including language, backend, and infra
https://darklang.com
Other
1.67k stars 90 forks source link

Component: Parsing, Pretty-Printing #5259

Open StachuDotNet opened 8 months ago

StachuDotNet commented 8 months ago

This Issue exists to collect many items that relate to Dark's parser(s), pretty-printer(s), name resolution, etc.

Here's our current state:

These are tasks currently available to be worked on:

Once the tree-sitter grammar and parser has 'caught up' with our full language:

Once that is done, we can tackle the fun stuff:

All of these tasks are worth some discussion, either here or in Discord, before starting.

StachuDotNet commented 6 months ago

Copying this from some thoughts I posted on Discord recently:

tl;dr: is tree-sitter really the best tool for our parser, or should we reconsider writing a parser combinator thing in Darklang?

The way we're currently set up for the new/tree-sitter parser is: A. write Darklang source code B. use tree-sitter and tree-sitter-darklang to parse to tree-sitter's internal representation of the syntax tree C. map that to a Dark type "ParsedNode," via a built-in function (the type: https://github.com/darklang/dark/blob/a68b808eb35d671e3921ce30ca357a67e166a995/packages/darklang/languageTools/parser.dark#L11-L27; the builtin fn: https://github.com/darklang/dark/blob/a68b808eb35d671e3921ce30ca357a67e166a995/backend/src/BuiltinExecution/Libs/Parser.fs#L37) D. map ParsedNode to WrittenTypes

those WrittenTypes are used:

  • to map to ProgramTypes, where relevant
  • to map to semantic tokens, for VS Code syntax highlighting

I've been questioning whether depending on tree-sitter for all of our parsing is a good idea.

An alternative would be that we write the parser in Darklang instead, potentially as wrapper/equivalent to Farkle or FParsec, via minimal Builtins. (relevant links:

Here are some potential trade-offs to consider:

  • the current A->B step:
    • requires us to build tree-sitter as an .so, as well as our grammar's .so. This is all set up now, but takes a few seconds of time, esp CI time.
    • requires our cli app to be ~1MB larger, to package those .sos along with our exe
    • requires a fancy extract-and-load setup to use both of those at run-time ()
  • the current C->D step:
  • we're broadly missing out on immediate feedback, throughout the process. We wait for the parser to be built, and have to follow each of those changes with ParsedNode-> WrittenTypes functions. And every grammar upgrade depends on a full build/release cycle, waiting for CI etc, to get things to users
  • I've no clear path forward on versioning the parser with our langauge, in a reasonably seamless way. as opposed to an in-Dark solution that would allow us to properly version the parser fns like anything else in the package manager.
  • our current setup provides only one big parser for a 'file', but what if we want to allow/disallow different parseable things if we're parsing a Canvas, vs parsing a Script, etc. I've been hoping we'd figure out a proper solution for that eventually, but everything I've come up with so far feels like a hack (i.e. passing a 'header' to the tree-sitter grammar where we). I think the composability of a parser combinator would prepare us for these scenarios much better.
  • broadly, it feels like we're doing (more than) double-work: we're writing the grammar.js, which builds into a parser, and writing a bunch of "parser.dark code" to map that back to WrittenTypes.

I suspect we'd still need a tree-sitter parser around, for highlighting and such in contexts outside of our VS Code plugin.

Am I forgetting a bit reason why we chose tree-sitter rather than exploring writing a parser in Dark/F#? Or maybe we've just learned more since and it makes sense to reconsider? Maybe we're making ParsedNode -> WrittenTypes more complicated than it needs to be?

Paul's response:

As I recall, the reasons to use tree sitter:

  • performance
  • ability to adapt to use in existing syntax highlighting frameworks and therefore reuse the definition

I would add that parser combinator frameworks are, afaik, possibly not powerful enough for real programming languages. But I could be wrong on that note

I don't think there's anything to do here, and we're close to a successful use of tree-sitter such that we'll be able to abandon our old F#-based parser, but I think it's worth reflecting here more, if we're doing the right thing fundamentally.