Component: Parsing, Pretty-Printing

This Issue exists to collect many items that relate to Dark's parser(s), pretty-printer(s), name resolution, etc.

Here's our current state:

in dark-classic, we didn't have a parser used for user code
that said, we did have a hacky parser used internally, for running many tests stored in .dark test files
that parser was a simple wrapper around F#'s parser, and so our syntax was limited somewhat by what the 'upper' parser could handle

These are tasks currently available to be worked on:

[ ] plug tree-sitter-tests-formatter in our repository, to (auto-) format the test files at tree-sitter-darklang/test/corpus, and fail in CI upon seeing unformatted tree-sitter test files (note: this task is probably the lowest-hanging-fruit here, with no blockers)
[ ] generally, expand the tree-sitter grammar to match our language
- [ ] each expansion requires companion work in the Dark code that consumes the resultant tree-sitter nodes (at time of writing, parser.dark)
[ ] restrict usage of Builtins, so that only specific stdlib package functions may call upon them
[ ] support aliases to unambiguously refer to package items while also presenting succinct code
[ ] lots of formatting improvements
[ ] try building tree-sitter and tree-sitter-darklang together. we could consolidate some code, etc
[ ] get our parser to a point where it's usable easily by folks outside of ourselves
[ ] revisit https://github.com/darklang/dark/pull/5381#issuecomment-2147575917

Once the tree-sitter grammar and parser has 'caught up' with our full language:

[ ] throw away the F#-wrapper parser entirely

Once that is done, we can tackle the fun stuff:

[ ] add ! ? to language, to assist with ergonomic error-handling
[ ] refer to package items with a @paul.module1.module2-like syntax, rather than PACKAGE.Paul.Module1.Module2
[ ] prevent conflicts of type names
- e.g. users shouldn't be allowed to define a List type
- in addition to preventing conflicts of existing types, keywords and other reserved word as well (i.e. Set)
- potentially something in the name resolver
- or maybe we allow users to use whatever type names they want, and deal with things closer to how Unison does

All of these tasks are worth some discussion, either here or in Discord, before starting.

Copying this from some thoughts I posted on Discord recently:

tl;dr: is tree-sitter really the best tool for our parser, or should we reconsider writing a parser combinator thing in Darklang?

The way we're currently set up for the new/tree-sitter parser is: A. write Darklang source code B. use tree-sitter and tree-sitter-darklang to parse to tree-sitter's internal representation of the syntax tree C. map that to a Dark type "ParsedNode," via a built-in function (the type: https://github.com/darklang/dark/blob/a68b808eb35d671e3921ce30ca357a67e166a995/packages/darklang/languageTools/parser.dark#L11-L27; the builtin fn: https://github.com/darklang/dark/blob/a68b808eb35d671e3921ce30ca357a67e166a995/backend/src/BuiltinExecution/Libs/Parser.fs#L37) D. map ParsedNode to WrittenTypes

those WrittenTypes are used:

to map to ProgramTypes, where relevant

to map to semantic tokens, for VS Code syntax highlighting

I've been questioning whether depending on tree-sitter for all of our parsing is a good idea.

An alternative would be that we write the parser in Darklang instead, potentially as wrapper/equivalent to Farkle or FParsec, via minimal Builtins. (relevant links:

https://github.com/stephan-tolksdorf/fparsec

https://teo-tsirpanis.github.io/Farkle (seems to be better for us than FParsec, per https://teo-tsirpanis.github.io/Farkle/choosing-a-parser.html)

https://www.youtube.com/watch?v=RDalzi7mhdY not expecting you to watch this, but a good talk on the subject.)

Here are some potential trade-offs to consider:

the current A->B step:

requires us to build tree-sitter as an .so, as well as our grammar's .so. This is all set up now, but takes a few seconds of time, esp CI time.

requires our cli app to be ~1MB larger, to package those .sos along with our exe

requires a fancy extract-and-load setup to use both of those at run-time ()

the current C->D step:

is pretty complicated, and involves some fragile code. there might be abstractions available here we haven't yet discovered, but it's a bit rough.

see https://github.com/darklang/dark/blob/main/packages/darklang/languageTools/parser.dark

we're broadly missing out on immediate feedback, throughout the process. We wait for the parser to be built, and have to follow each of those changes with ParsedNode-> WrittenTypes functions. And every grammar upgrade depends on a full build/release cycle, waiting for CI etc, to get things to users

I've no clear path forward on versioning the parser with our langauge, in a reasonably seamless way. as opposed to an in-Dark solution that would allow us to properly version the parser fns like anything else in the package manager.

our current setup provides only one big parser for a 'file', but what if we want to allow/disallow different parseable things if we're parsing a Canvas, vs parsing a Script, etc. I've been hoping we'd figure out a proper solution for that eventually, but everything I've come up with so far feels like a hack (i.e. passing a 'header' to the tree-sitter grammar where we). I think the composability of a parser combinator would prepare us for these scenarios much better.

broadly, it feels like we're doing (more than) double-work: we're writing the grammar.js, which builds into a parser, and writing a bunch of "parser.dark code" to map that back to WrittenTypes.

I suspect we'd still need a tree-sitter parser around, for highlighting and such in contexts outside of our VS Code plugin.

Am I forgetting a bit reason why we chose tree-sitter rather than exploring writing a parser in Dark/F#? Or maybe we've just learned more since and it makes sense to reconsider? Maybe we're making ParsedNode -> WrittenTypes more complicated than it needs to be?

Paul's response:

As I recall, the reasons to use tree sitter:

performance

ability to adapt to use in existing syntax highlighting frameworks and therefore reuse the definition

I would add that parser combinator frameworks are, afaik, possibly not powerful enough for real programming languages. But I could be wrong on that note

I don't think there's anything to do here, and we're close to a successful use of tree-sitter such that we'll be able to abandon our old F#-based parser, but I think it's worth reflecting here more, if we're doing the right thing fundamentally.

darklang / dark

Component: Parsing, Pretty-Printing #5259