Error tolerant Parsing and Type Checking to improve editor support

calebh / Juniper

https://www.juniper-lang.org/

MIT License

79 stars 9 forks source link

Error tolerant Parsing and Type Checking to improve editor support #11

Open atadi96 opened 1 year ago

atadi96 commented 1 year ago

Hello!

The language looks awesome! I just started using it but I didn't find editor support except for syntax highlighting, so I started to create a Language Server Protocol implementation and a VS Code extension which you can check out in my fork of this repo. I think it already gives a pretty nice basic programming experience :)

So far I've tried to make as little modification to the original compiler as possible but I think I've more or less exhausted the LSP features that can be implemented this way. If you like these changes I made to the compiler, I can create a PR.

In order to support more LSP features for the language (completion suggestions in lots of places, semantic highlighting, types on hover, go to definition, etc...) there's a need for a more error tolerant parser and a type checker that can report multiple errors at once or give incomplete (but useful) information on a file that has syntax or type errors. Stretch goal could be some kind of incremental parsing/type checking.

I'm happy to already start working on extending the parser and the type checker this way. I'm planning to make changes including but probably not limited to:

Parser that records all text and positions in the AST (even keywords, comments, etc, useful for semantic highlighting)
Parser that recovers from syntax errors that happen typically when you type new code.
Type checker that can work with an AST that contains parse errors.
Type checker that never or rarely throws an exception (therefore records type errors in some other way)
Type checker that can output multiple constraint errors instead of throwing on the first one.
Type checker that keeps track of type variables' and capacity variables' original names (for more readable type strings)
(?) Type checker that can use the results of the last type check, and only recheck the AST that has changed
General refactoring, e.g. to reduce warnings in the code, or adding strong aliases e.g. type ModuleName = ModuleName of string
... I know I must be already forgetting something

Any change/progress I make I would be happy to push back to this repo, but I'm not confident that it's useful for you to accept the changes. Ideally, nothing I do will change e.g. the typing rules (*for well typed programs anyway) but the details of the actual implementation could differ greatly. Maybe it's important that you're closely familiar with the codebase in case you need to work on a new version or a new paper, so a big change would cause difficulties in these areas. On the other hand, if you accept the changes and implement any future modifications on top of them, then the improvements would be instantly available to the Language Server implementation and any editor extensions. I'd ask you to share your viewpoint on this.

So in summary, I'd like to extend the parser and the type checker in order to implement more Language Server Protocol features for an enhanced editing experience of Juniper source code. Or that's my plan anyway, which I'm not sure exactly how far will get in reality.

I'd like to hear your thoughts on the topic generally, and on the possible joint future of the original compiler and the LSP version. Thanks in advance!

calebh commented 1 year ago

Yes, contributions by the community are very welcome! Just a heads up that I'm in the middle of moving to a new apartment, and will likely be getting set up over the next few days (I just got Internet access today). I did look into language server implementation a few weeks ago, but like you say the main issue is with the parser. I looked into using Tree-sitter for parsing, but the problem is that there are no .NET bindings for that library. Do you have an alternative parsing solution in mind to replace the current parser?

Once I get some free time in the next few days I'll take a look at your code. One thing you'll want to check is the jun4 branch in the main repo, it contains all the most recent commits. I have been dissatisfied with the syntax for quite some time, namely the use of the "end" keyword to close certain expressions and use of parentheses for sequence expressions. Check out the junstd on the jun4 branch to get up to speed on some of the syntax changes. The other big changes are allowing type variables and capacity variables to be arbitrarily ordered in type expressions, and the removal of explicit template declaration for functions. Also type variables no longer require ticks in the front. For full details of all the changes see the commit messages.

Patching the type checker to be compatible with LSP will not be too hard, currently entire strongly connected components are typechecked individually. The hardest part I think will be the parser.

Another thing to consider is possibly bootstrapping the compiler, abandoning the F# compiler and rewriting it in Juniper. Juniper can call C++ code, but there aren't even Tree-sitter bindings for C++! Not to mention that the type inference engine uses a C# CAS (computer algebra system) library for the capacities. We would also have to allow recursive types for the abstract syntaxes trees. That means dealing with garbage collection or making cycles via references impossible to construct. Basically Juniper would cease to be a DSL for Arduino/embedded programming and would become a fully featured functional programming language. I'm not sure if there's really space in the community for yet another functional programming language, although it may broaden the audience from just Arduino enthusiasts.

Regarding losing the names of type variables and capacity variables: this happens when the compiler needs to typecheck things like mutually recursive functions, each of which are generic and use type variables of the same name. To distinguish between the variables of the same name in the different functions you need to rename them. After the typechecking is done, I think it would be okay to change the names back. Right now there is a silly workaround where the user given names are set in the body of the functions via C++ using statements.

I have also been working on a new website, which I don't think I've posted yet. The new design looks better than the old one, but it's not really a masterpiece.

Looking forward, I should have more time to work on Juniper now that I'm finished with my Master's degree. The only wildcard is how much of my attention my new job will take.

calebh commented 1 year ago

I see now that Tree-sitter generates a parser written in C, so this seems slightly more feasible. I'm going to talk to someone I know from undergrad who did his PhD thesis on garbage collection to see if there's any feasible niche that Juniper could fill by becoming a fully fledged functional programming language.

atadi96 commented 1 year ago

Thanks for taking the time to reply even in the middle of moving!

Also thanks for the heads-up about the jun4 branch. I haven't looked at it yet and I'm continuing to work based on the main branch. Based on what you wrote I hear that you're mostly working on syntax changes so I hope that an LSP for version 3 can be updated to version 4 without fundamental changes.

My idea for the error tolerant parser is heavily based on this repo: https://github.com/terrajobst/minsk . Basically it's a hand-written recursive descent parser that has two recovery tactics: it skips over lexical errors, and inserts expected but missing tokens at the syntax level. My plan is to write a similar parser for Juniper's syntax. I just did a tiny proof-of-concept for parsing the following syntax so far:

<module>            ::= "module" <id> [<open>];
<open>              ::= "open" "(" [<id> {"," <id>}] ")";

You can find all of the relevant code in the Parsing folder.

I wanted to rely on FParsec as much as possible but in the end I could only use it for lexing... Seems overpowered but at least it's concise and performant for that stage, I guess. I did not find a way to reliably integrate the token inserting behavior into FParsec parsers directly, so the parser must be handwritten like in the example repo. My issue was that the parser is in many cases looking at the current token without consuming it, e.g. it checks the token kind to decide which sub-parser should then actually consume the token. I haven't find a way to do this without constant backtracking on the stream, but it might be out there somewhere.

Did you get any info about the garbage collection topic? Btw, if embedded systems are still in focus, what do you think about some kind of linear typing, or uniqueness typing like in e.g. Clean? I'm not too familiar with those type systems but as far as I know they help with the performance and memory footprint of the compiled code, which might be useful for the Arduino.

atadi96 commented 1 year ago

... Based on what you wrote I hear that you're mostly working on syntax changes

So now that I checked the branch I'm kinda embarrassed that it didn't click that optional template declarations are not just syntactic sugar... definitely looking forward to the new stuff!

Here's another little material from MS on tolerant parsers: https://github.com/microsoft/tolerant-php-parser/blob/main/docs/HowItWorks.md The minsk repo uses similar techniques but somewhat simplified.

I've implemented the following subset of the Juniper 3 syntax so far:

<module>            ::= "module" <id> [<declaration> {, <declaration>}];
<declaration>       ::= <open>
                     |  <let>;
<open>              ::= "open" "(" [<id> {"," <id>}] ")";
<let>               ::= "let" <id> [":" <ty-expr>] "=" <expr>;
<declaration-ref>   ::= <id> | <module-qualifier>;
<ty-expr>           ::= <declaration-ref>;
<module-qualifier>  ::= <id> ":" <id>;
<expr>              ::= <integer>
                     |  <expr> <binary-op> <expr>
                     |  "(" <expr> ")";
<binary-op>         ::= "+" | "-" | "*" | "/";

and I think the error recovery is looking awesome!

Besides continuing to extend the grammar step-by-step, one of the next steps will be that I write a function to validate that the old and new ASTs are equivalent for syntactically correct code.

The parser code follows minsk's naming conventions in order to easily map between the two and see what's missing but that can change once all the relevant functions from minsk exist in F#, too.

calebh commented 1 year ago

I'm finally settled into my new place and have my desk set up. I have not had a chance to ask my friend from Microsoft Research about the garbage collection yet. I did update the EBNF grammar for Juniper 4 here: https://github.com/calebh/Juniper/blob/jun4/grammar.bnf

Note the primary changes are removal of starting/ending keywords in expressions. The parser is now whitespace sensitive, especially around expression blocks (which are now enclosed using curly braces instead of parens). The example I was thinking of is with function calling/tuples combined with expression blocks:

// This is an expression sequence, where the first expression
// is "foo" and the second is a tuple containing two integers
{
     foo
     (1u8, 2u8)
}

Compare this to:

// This is an expression sequence of one expression. The expression
// is calling a function named "foo" with two input parameters
{
     foo(1u8, 2u8)
}

However some expressions need to be able to expand multiple lines. For example, the following is a function call to foo:

{
     foo(1u8, 
            2u8)
}

Or the pipe expression:

{
     foo |>
     bar()
}

My primary parsing experience is using lex/yacc and parsing combinators. I had a very negative experience with lex/yacc - I found the errors reported by the parser generator to be confusing and difficult to understand. I've had a much better experience with parsing combinators, but the primary issue is that there is no guarantee that your parser handles ambiguity correctly. Parsing combinators do not have a separate lexing step, and do not use tokens. FParsec was never designed to gracefully handle partially completed programs/text, and is the wrong tool for this job. Having used different parsing technology for a while, I'm actually in favor of a lexing (tokenization) step before parsing, it seems like it would more gracefully handle things like whitespace sensitive programs.

atadi96 commented 1 year ago

I just now finished the Trivia part of the lexer and parser! Trivia is basically whitespace, newlines and comments, and there's a list of leading and trailing trivia for each token. Here's the SyntaxTrivia type in the code. Here's more info on Trivia from the Minsk repo.

So this helps to keep track of all whitespace-like stuff around tokens.

For example, if I understand the 3.0 parser correctly, we have the following situation: If we expect an expression, then a:bcd is parsed as <module-qualifier>, while a: bcd, a :bcd and a : bcd are all parsed as <expr> ":" <ty-expr>. This can be encoded with FParsec, but if we have a lexer then all these look like <identifier> ":" <identifier>. However, using Trivia, we can check that a and : don't have any trailing trivia, and then we can apply the <module-qualifier> rule.

I think this is very relevant to the cases that you specify. Just going by the examples, I could describe it this way: inside the { } block, whenever we could apply a left recursive rule on the expression at the current position, but the trailing trivia contains a newline, then it's taken as a separator. So yeah, as you're saying, the separate tokenization step definitely helps with whitespace sensitivity!

So I'm hopeful that the way I'm currently building the parser it will be able to handle this as well. With the latest commit I think all of the general parts of the parser are in place, and the rest is just a matter of adding all the syntax rules.

I'm relieved that you're in favor of a separate lexing and parsing step because that's what I've been working on (because that's what I had source material for). When you have time to check out the new stuff, please share your thoughts on it!

calebh commented 1 year ago

Yes, a module qualifier is something like List:map, with no spaces surrounding the colon. A type must have whitespace as you noted. Note that in the current parser, type constraints are left recursive, ie

42u8 : uint8 : uint8 : uint8

is a perfectly valid expression.

I've started to look through the LSP documentation, and here is a high level overview of how I would implement the LSP.

Keep the untyped AST stored for every .jun file the user has open, or alternatively what is present in the current project.
Keep a dependency digraph for all declarations in all modules this project uses, along with a type signature for each of these declarations.
When the user makes an edit to a declaration, re-parse either the entire module, or if incremental parsing is implemented, re-parse the declaration that changed. Update the dependency graph, keeping in mind that there may be cycles of functions that may have changed and may need to be re-type checked.
For missing expressions, I will add a special "hole" expression, which is used as a placeholder for an expression to be added. The type checker will attempt to infer the type of the hole, which should enhance the usability of some parts of the language server. Thoughts on this? Is it a necessary feature?
When the user makes an edit to a declaration, re-run the type checker on the declaration's SCC to determine if the type of the declaration has changed and then recursively type check all declarations that depend on the changed declaration.

Things like goto, rename, and hovering should be fairly easy to implement and only requires additional book-keeping.

One thing that I would like to maintain is a "functional core, imperative shell" design. The server part of this system will obviously require imperative type programming, but I would like the core parts of the compiler (for example the constraint solver and the type checker) to remain functional.

I am planning on merging the jun4 branch into the main branch sometime soon. We should make sure that you're working with the current jun4 codebase.

calebh commented 1 year ago

@atadi96 Just a heads up that Juniper 4.0.0 has been released. There is some LSP functionality in the codebase, but nothing has been exposed to the outside. Most of what I've added has been related to tracking cross references in the code. The types are defined at https://github.com/calebh/Juniper/blob/master/Juniper/Lsp.fs and the typeof function (which does the typechecking) returns a CrossRefInfo list: https://github.com/calebh/Juniper/blob/master/Juniper/TypeChecker.fs