Open ebkalderon opened 4 years ago
This should resolve #5, #7, #9, #10.
The new lexer is now considered feature-complete and has been extracted from nix-parser2::lexer
into a separate nix-lexer
crate.
There is some potential for nom
to be usable for the parser after all, given practices outlined in "Syntax error recovery in parser expression grammars" (Medeiros, S. and Fabio Mascarenhas, 2018). I've created a write-up on my blog here.
Since commit 2f2d8cf, efforts have begun to completely redesign the existing lexer and parser, with the new code currently residing in the
nix-parser2
directory.Existing pain points
Lexer
Vec
. This is a big design kludge also a performance footgun. Ideally, the lexer should be a dumb iterator yielding a flat, lazy sequence of tokens which map directly to spans in the source text.Token
enum carries its own duplicateSpan
information. This should be factored out into aToken
struct which contains both aSpan
and aTokenKind
, which eliminates the need for implementingToSpan
onToken
.''
strings, converting paths intoPathBuf
values, converting booleans intobool
values, etc. There should be a stronger separation of concerns, where the produced tokens only carry span information and little else, and the parser is free to slice and retrieve the source text based on those spans, if it so desires.Token::Error
. This confounds the parser at later stages, causing it to emit perplexing errors such asexpected path, found `/foo/bar/`
andunexpected token `'`
, respectively. Instead, we should still emitToken::Path
andToken::String
so parsing can continue but mark them as invalid. This matches whatrustc_lexer
does.Parser
nom
is very hard to get right. It leads to very convoluted code as well as performance footguns, if you're not careful. Producing high-quality diagnostics fromnom
combinators is tedious as well.nom
is very difficult when compared to a hand-written parser.nom
is hard-wired to returnIncomplete(n)
when receiving incomplete input, which is decidedly not what we want when processing a programming language.nom
parser, e.g. an arena allocator or string interner, is very cumbersome due to the restrictiveFn
bounds on combinators.'
characters inside indented''
strings.Current state
The new lexer present in the
master
branch still relies onnom
because it works very well for making small and efficient lexers. Since the new lexer is lossless and will never produce an error, there's no need to struggle with customnom
error management, and the code for it looks much simpler as a result. The redesigned lexer is currently in an MVP state and is essentially feature-complete, though a few pieces are still missing, most notably a function for checking delimiter balancing.The new parser will ditch
nom
and will most likely be hand-written for the sake of better error recovery and higher quality diagnostics. The precise design hasn't been fully fleshed out yet. Instead of producing an AST, the new parser will produce a CST (possibly based onrowan
, but could also be hand-written) whose nodes cannot be constructed by hand. However, the CST can be simplified down into an AST, whose nodes can be constructed by hand.CC @dingxiangfei2009