Explore error-correcting parser (incomplete source files)

matthewhammer commented 3 years ago

Perhaps this exploration can begin with a list of examples? (What concrete partial programs today would be rejected, but accepted by this parser?)

matthewhammer commented 3 years ago

Related idea? https://hazel.org/

kritzcreek commented 3 years ago

I'd like to have a parser I could use for all of our tooling needs and ideally in the compiler itself.

The tooling I/We'd like to see:

Syntax Highlighting
IDE refactorings/queries
Code formatter
Documentation generator
Parser errors in the compiler

In order to support these use-cases a parser should match the following criterias:

Functional requirements:

Incremental (Syntax highlighting, IDE)
Error-recovering (Syntax highlighting, IDE)
Lossless (keeps Whitespace & Comments) (Formatter, Documentation Generator, IDE Refactorings)
Parent-Pointers (IDE)

Technical requirements:

"Generic" tree datastructure underneath a well typed AST view onto it
Speed, reasonable amount of allocation
Well tested/easily testable
Matches our existing Merlin grammar

We're not the first language to have these requirements/wishes, so there's lots of previous/current work to take inspiration from. In my opinion the most promising approach seems to be using a Red-Green Tree, as in Roslyn/Swift/rust-analyzer. I'm happy to go into more detail on these but I want to keep this comment focused on the high-level plan.

I don't know how to get good incrementality and error recovery without using a handwritten parser yet. None of the parser generators I've used so far hold up, tree-sitter is great in incrementality but error recovery is weaker than I'd like.

I also only know of libraries for these things in either Rust (rowan) or C++ (tree-sitter). So we'd need to either write libraries for these things in OCaml, marshal the ASTs around, which might be expensive, or write a second parser for Motoko. Not that that's necessarily a bad thing.

Current status:

I'm working on building a parser for a very small language using the rowan library on my own time. If this turns out to be a viable path I'll pitch it to the team with more details and an actual implementation plan and estimate. At that point I'd also love to find one or two additional folks to join me in the effort (happy to experiment together with others right now, but I don't really have a plan yet 😸 ).

References:

Red-Green-Trees:
- High-level overview: https://ericlippert.com/2012/06/08/red-green-trees/
- Implementation/Performance: notes https://stackoverflow.com/questions/10417169/are-roslyn-syntaxnodes-reused/10417510
- Swift-Syntax design: https://github.com/apple/swift/blob/5e2c815edfd758f9b1309ce07bfc01c4bc20ec23/lib/Syntax/README.md
- rust-analyzer design: https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md
rowan - https://github.com/rust-analyzer/rowan
tree-sitter - https://tree-sitter.github.io/tree-sitter/
- Problems with error recovery in tree-sitter: https://rust-analyzer.github.io/blog/2020/09/16/challeging-LR-parsing.html

chenyan-dfinity commented 3 years ago

Some random thoughts on incrementalization:

Lark: https://github.com/lark-exploration/lark/blob/master/README.md

Lark, both as a tool and as a language, was designed to treat both the base-line compiler and interactive code editing as first class concerns.

It seems to be a very similar goal as you outlined here. For the incremental part, they are using the salsa framework: https://crates.io/crates/salsa (and you will see a familiar name gets mentioned on their page :) )

Can we formulate the parsing/type checking problem in datalog? Incremental evaluation is much easier there (with tools like Soufflé or DDLog), but I guess error recovery is much worse.

kritzcreek commented 3 years ago

Some random thoughts on incrementalization:
* Lark: https://github.com/lark-exploration/lark/blob/master/README.md
Lark, both as a tool and as a language, was designed to treat both the base-line compiler and interactive code editing as first class concerns.

It seems to be a very similar goal as you outlined here. For the incremental part, they are using the salsa framework: https://crates.io/crates/salsa (and you will see a familiar name gets mentioned on their page :) )
* Can we formulate the parsing/type checking problem in datalog? Incremental evaluation is much easier there (with tools like Soufflé or DDLog), but I guess error recovery is much worse.

Oh yeah I'm aware of all of those ;) But thanks! My primary goal here isn't incrementality. Parsing is fast enough that we can get away with coarse grained heuristics, and writing a parser/CST/AST like this is a big enough task on its own. Just doing the parser gives us the ability to work on lots of tooling and it's a somewhat self contained project, so I'd like to keep it focussed.

osa1 commented 3 years ago

I'm curious about the current status of this? Were you able to make progress @kritzcreek?

I also only know of libraries for these things in either Rust (rowan) or C++ (tree-sitter).

tree-sitter has a Rust API too and it works great.

However, as you say, error recovery of tree-sitter is not great. A while ago I asked how tree-sitter compares with rust-analyzer's hand-written parser to the author or rust-analyzer, he said tree-sitter's correction is worse than their hand-written one. He also showed one or two examples. I don't remember where the discussion was though so can't link it. Probably on rust-lang.zulipchat.com.

I think their position is that you can't beat hand-written parsers for error recovery and reporting, though implementing a parser from scratch is often too painful/requires a lot of work.

Re: incrementality: I think you mean an API like fn insert(current_ast: AST, position, text) -> AST, right? (similarly for deleting) I have no idea if rust-analyzer has an incremental parser, but this problem seems very difficult to me. If anyone here know any parsers that do this I'd like to take a look.

Also, did anyone try to do parsing (any language, doesn't need to be Motoko) with incremental computation frameworks/libraries like Adapton or salsa? Shouldn't be too difficult to try it out on a simple language, I've been meaning to do it for a while now. Perhaps I can give it a try now.

osa1 commented 3 years ago

For incremental parsing I guess we'll also need incremental lexing, right? Any thought on that? Or do you plan on lexing from scratch on every change, taking the diff from previous token list, and pass the changes to the parser somehow? (maybe I completely misunderstood what you mean by incrementality)

kritzcreek commented 3 years ago

For the purpose of an IDE relexing the full file is fast enough, so I don't care about that yet, but if we ever do care it should be possible to add later on.

My main priority for this task right now is to get a lossless, error recovering parser that produces the equivalent of a Red-Green-Tree (and to then use that to get semantic syntax highlighting, and context when triggering autocompletions).

kritzcreek commented 3 years ago

@osa1 Sorry, I was on my phone and completely missed your first comment.

I'm curious about the current status of this? Were you able to make progress @kritzcreek?

I did get started on a parser using the rowan library as I said before. It parses the grammar for types, and I've created the "Red-tree" nodes, so I have an API to access and traverse the generated tree. Next I'd like to try a few different approaches to error-recovery. Claudio showed me the Hack parser (handwritten in Rust), which keeps a context stack around and uses it for recovery and I'd like to give that a try.

It's in a private repo of mine, but it doesn't have to be. It's not even good enough for a PoC yet so I didn't want to move it into the DFINITY org, but I'm happy to add you as a contributor, or just make it a public repo of mine.

I think their position is that you can't beat hand-written parsers for error recovery and reporting, though implementing a parser from scratch is often too painful/requires a lot of work.

I generally agree with matklad, and having worked with IntelliJ's PSI (Program Structure Interface, their version of Red-Green-Trees) I can see where lots of his ideas originate.

Re: incrementality: I think you mean an API like fn insert(current_ast: AST, position, text) -> AST, right? (similarly for deleting) I have no idea if rust-analyzer has an incremental parser, but this problem seems very difficult to me. If anyone here know any parsers that do this I'd like to take a look.

rust-analyzer has something like this, where it reparses everything inside the closest matching { } that enclose the change. The module comment in here describes it pretty well: https://github.com/rust-analyzer/rust-analyzer/blob/7f12a1f225c7d3397f27964ce039b55d680772d3/crates/syntax/src/parsing/reparsing.rs

Their primary use-case is actually to reparse a block after inserting a dummy identifier, for auto-completions. It's a trick from IntelliJ, where you turn Foo. into Foo.intelliJRulezz which repairs the expression and then lets you search for intelliJRulezz in the AST to get the users cursor and the relevant context. They figured out that that's easier than to make the parser recover from missing identifiers like that.

osa1 commented 3 years ago

Thanks @kritzcreek

rust-analyzer has something like this, where it reparses everything inside the closest matching { } that enclose the change. The module comment in here describes it pretty well: https://github.com/rust-analyzer/rust-analyzer/blob/7f12a1f225c7d3397f27964ce039b55d680772d3/crates/syntax/src/parsing/reparsing.rs

I didn't read line by line but it looks simple enough. Do you know if they also do incremental re-type-checking (or borrow checking) after e.g. inserting an identifier or function call?

It's in a private repo of mine, but it doesn't have to be. It's not even good enough for a PoC yet so I didn't want to move it into the DFINITY org, but I'm happy to add you as a contributor, or just make it a public repo of mine.

I'm afraid I won't have time (RTS stuff + may have to work on canisters next year, not sure), but I'd like to learn more about IDE support. Would be great if I could spend maybe a day or two a week on this project (assuming we want to continue with this, provide good IDE support).

Just to make sure I understand, you're implementing a hand-written parser using rowan for the AST, right? What kind of edits you want to support, and what kind of feature do you want to provide with that parser? Do you want to provide auto-completion? Or fast type checking? Or something else?

kritzcreek commented 3 years ago

Oh sorry, I completely missed the last paragraph:

Just to make sure I understand, you're implementing a hand-written parser using rowan for the AST, right?

Yes

What kind of edits you want to support, and what kind of feature do you want to provide with that parser? Do you want to provide auto-completion? Or fast type checking? Or something else?

The first features I'll provide don't need a typechecker for partial ASTs, as that's a whole project of its own... I'll start with things like:

Syntax Highlighting
Source code formatting
Querying for context at a given source position The current language server could really use the ability to ask for the namespace (type/value) under cursor to give better contextual completions).
Maybe smaller non-type aware edits like flipping if's around, or sorting imports etc.

dfinity / motoko

Explore error-correcting parser (incomplete source files) #1935