Redesign nix-parser crate

ebkalderon commented 4 years ago

Since commit 2f2d8cf, efforts have begun to completely redesign the existing lexer and parser, with the new code currently residing in the nix-parser2 directory.

Existing pain points

Lexer

Lexer is not fully lazy when it comes to string interpolations, as it collects all of the interpolated tokens eagerly into a nested Vec. This is a big design kludge also a performance footgun. Ideally, the lexer should be a dumb iterator yielding a flat, lazy sequence of tokens which map directly to spans in the source text.
Each variant of the Token enum carries its own duplicate Span information. This should be factored out into a Token struct which contains both a Span and a TokenKind, which eliminates the need for implementing ToSpan on Token.
Lexer does some work which the parser should be doing, such as unescaping strings, trimming the leading whitespace from '' strings, converting paths into PathBuf values, converting booleans into bool values, etc. There should be a stronger separation of concerns, where the produced tokens only carry span information and little else, and the parser is free to slice and retrieve the source text based on those spans, if it so desires.
Lexer should be 100% lossless. Currently, if a path token contains a trailing slash or a string literal is left unterminated, the lexer yields a Token::Error. This confounds the parser at later stages, causing it to emit perplexing errors such as expected path, found `/foo/bar/` and unexpected token `'`, respectively. Instead, we should still emit Token::Path and Token::String so parsing can continue but mark them as invalid. This matches what rustc_lexer does.

Parser

Error management with nom is very hard to get right. It leads to very convoluted code as well as performance footguns, if you're not careful. Producing high-quality diagnostics from nom combinators is tedious as well.
Writing an incremental parser with nom is very difficult when compared to a hand-written parser. nom is hard-wired to return Incomplete(n) when receiving incomplete input, which is decidedly not what we want when processing a programming language.
Threading state through a nom parser, e.g. an arena allocator or string interner, is very cumbersome due to the restrictive Fn bounds on combinators.
Parser should be 100% lossless. It should not produce an AST, but rather a CST (concrete syntax tree) which preserves all whitespace, comments, and invalid tokens. This would be usable in a code formatter or autocompletion engine, for example. The CST can simplified into an AST at a later stage. See the Roslyn architecture for more details.
Parser currently can't handle spaces inside string interpolations correctly nor escaped ' characters inside indented '' strings.
Implementing overflow arguments, trailing commas, and ellipses in formal function arguments is tricky. Implementing a strategy which yields high-quality diagnostics is even trickier.

Current state

The new lexer present in the master branch still relies on nom because it works very well for making small and efficient lexers. Since the new lexer is lossless and will never produce an error, there's no need to struggle with custom nom error management, and the code for it looks much simpler as a result. The redesigned lexer is currently in an MVP state and is essentially feature-complete, though a few pieces are still missing, most notably a function for checking delimiter balancing.

The new parser will ditch nom and will most likely be hand-written for the sake of better error recovery and higher quality diagnostics. The precise design hasn't been fully fleshed out yet. Instead of producing an AST, the new parser will produce a CST (possibly based on rowan, but could also be hand-written) whose nodes cannot be constructed by hand. However, the CST can be simplified down into an AST, whose nodes can be constructed by hand.

CC @dingxiangfei2009

ebkalderon commented 4 years ago

This should resolve #5, #7, #9, #10.

ebkalderon commented 4 years ago

The new lexer is now considered feature-complete and has been extracted from nix-parser2::lexer into a separate nix-lexer crate.

ebkalderon commented 4 years ago

There is some potential for nom to be usable for the parser after all, given practices outlined in "Syntax error recovery in parser expression grammars" (Medeiros, S. and Fabio Mascarenhas, 2018). I've created a write-up on my blog here.

ebkalderon / nix-language-server