Open taktoa opened 8 years ago
Looks like the BNFC source requires GADTs
, DataKinds
, etc. (full list here), so JHC is almost certainly not workable.
I think compiles using -fvia-C
are not as portable as I thought, since the GHC runtime contains assembly code.
Not sure about the portability of GHC-generated LLVM bitcode... probably has the same issue as -fvia-C
.
GHCJS + Node should still definitely be workable, assuming we are okay with adding Node to our build closure (AFAIK it's available on pretty much every platform, considering that it is based on V8, so bootstrapping shouldn't be a big deal, and its runtime closure is only 30 MB).
cc @Gabriel439
How much time is spent in the Nix parser versus other parts of evaluation?
nix-instantiate
has a --parse
flag, you could measure that time on nixpkgs source tree.
Can BNFC track state for our string escaping logic? See https://github.com/Gabriel439/nixfmt/issues/3
@domenkozar Having now sufficiently played around with BNFC, this is an issue. BNFC supports declaration of comment syntax, so I find it reasonably likely that they would be willing to consider adding support for custom string syntax and antiquotation. The main issue is figuring out how to compile it away while still supporting as many backends as possible. I'll take a look at emailing their mailing list.
2: Performance. Sometimes we do parse quite large files that are mainly JSON-like data describing even thousands of packages. I expect these are relatively easy on the evaluator and it wouldn't be nice if it should slow down significantly. In as slightly different all-packages.nix
case, the file is huge (to parse) but typically only small parts of it are actually interpreted, thanks to laziness.
3: Making nix depend on Haskell stuff could be problematic, but IMHO it wouldn't be a significant obstacle if we could lift that dependency in released tarballs (by distributing the generated stuff within; few enough people will want to patch the parser).
x: Some small corners of the syntax do feel unintuitive to me, but I expect most users won't appreciate a BNF-like description. True parser "bugs" seem very rare to me.
How fancy a grammar do we need? I suppose we could write a grammar that can be compiled to C++ by C++ and compiled to Haskell by Haskell.
Since I now realize that Nix does not have a context-free grammar (due to string antiquotation), I think there isn't much merit to having a BNF describing the syntax, unless BNFC gets explicit support for string antiquotation.
That said, I do think there is room for improvement in the Nix expression parser. Particularly:
There are some less difficult changes we could make to the Nix expression parser that may still be worthwhile:
atof
). I'm less confident about this suggestion.Since I now realize that Nix does not have a context-free grammar (due to string antiquotation)
Hmm? I am quite surprised by this.
At the very least, it requires fancy lexing to be parsed remotely efficiently.
The meaning of "
depends on the depth of previously parsed antiquotations, so you need a stack to lex it.
Sure that makes sense, but is a rather different from being context-sensitive :).
Fancier lexing might work. I expect "somestring${
could be lexed (almost) as "somestring" +
followed by a special kind of opening parentheses.
IIRC it shouldn't matter that the }
token would be used both for closing this one and for standard attribute sets/maps.
Ericson2314: yeah you're definitely right; it was one of those situations where my inability to come up with a working BNFC led me to rationalize
vcunat: I had that thought, but what of ${
used outside strings? e.g.: let bar = "baz"; in foo.${bar}
Hmm, yes that's also a correct antiquation :-) If the lexer does something like: "somestring${
-> "somestring"
+
${
, it shouldn't matter that antiquoting (i.e. ${
EXPR }
) is also valid in some other cases. In the worst case, some bad usages could be caught during semantic analysis.
Yeah we just need context-free, not regular, lexing. This isn't so bad: you can think of s-expresion -> lisp and token tree -> Rust as an analogous composition of context-free grammars.
A "token tree" language for Nix, as @vcunat hints at, could look something like:
token-tree ::= '(' token-tree* ')'
| '[' token-tree* ']'
| '{' token-tree* '}'
| '"' string '"'
| '\'\'' multiline-string '\'\''
| # atomic tokens
string ::= # empty nonterminal
| non-$-or-"-char string
| '$'
| '\$' string
| '$' non-{-or-"-char string
| '${' token-tree* '}' string
multiline-string ::= # empty nonterminal
| non-$-or-\'-char multiline-string
| '\'' non-$-or-\'-char multiline-string
| '$'
| '\'\'${' multiline-string
| '$' non-{-or-\'-char multiline-string
| '${' token-tree* '}' multiline-string
the string stuff is really nothing but escape sequence handling---until you actually parse an antiquoted token tree its regular. I did a fairly ugly and probably not-quite-correct job---somebody with more experience in this area could clean it up.
In practice, I'd hope we could use some sort of macro / parameterized rule to deduplicate string and multi-line-string.
I think this should be a commit hook that generates C++ code to check in to minimize the closure size
This discussion also won't get very far until we actually implement a proof-of-concept parser using BNFC to get actual numbers on how fast it is compared to the current lparser
One other approach is to factor out Nix's lexer logic into a small C library that can be linked into other languages since the lexer contains most of the complexity. This C library would just take an input string and output a stream of tokens. Once you have something that can generate the token stream it's much easier to write a parser in each downstream language.
@Gabriel439 one problem with that is the optimal way to stream is in fact quiet language-specific. There's great things Haskell and Rust do that are just too much of a pain to write in C (not sure about C++).
C can "stream" well enough e.g. by invoking a callback with each token separately. If you inline these calls, it could be rather efficient. EDIT: but we're digressing, I think.
FWIW I've made an attempt at the BNF and LEX in NixIDEA using Grammar-Kit(extendedPin false). They're not perfect though, the lexer keeps a state and I had to manually add yy_*_state
, but might help someone giving BNFC(which I've never used) a go.
I would also really like it if there was one canonical BNF/LEX definition to consume/work with.
BNFC's LBNF grammar targets LALR(1) parsers generators which are not powerful enough to parse Nix. The one-token look-ahead cannot decide whether { ... }
ought to be an attribute set or a function argument pattern. At the time the parsers sees the {
character, it needs to know because both constructs have different syntax, but there's usually no way to know without looking at the token after the closing }
. A simple NixExpr.cf
grammar that demonstrates the problem is:
-- A simple (incomplete) grammar for the Nix language.
comment "#" ;
comment "/*" "*/" ;
token Symbol letter (letter | digit | '_' | '-')* ;
-- token QString {"''"} char* {"''"} ;
-- token URL letter+ {"://"} (char - [" \n\t;[]{}"])+ ;
-- token Path {"./"} (char - [" \n\t;[]{}"])+ ;
coercions Expr 10 ;
Atomic . Expr10 ::= Atom ;
ConcatList . Expr9 ::= Expr10 "++" Expr9 ; -- right associative
ConcatString . Expr8 ::= Expr8 "+" Expr9 ; -- left associative
Not . Expr7 ::= "!" Expr7 ;
Union . Expr6 ::= Expr7 "//" Expr6 ; -- right associative
Equal . Expr5 ::= Expr5 "==" Expr5 ; -- no associativity
NotEqual . Expr4 ::= Expr4 "!=" Expr4 ; -- no associativity
And . Expr3 ::= Expr3 "&&" Expr4 ; -- left associative
Or . Expr2 ::= Expr2 "||" Expr3 ; -- left associative
Imply . Expr1 ::= Expr1 "->" Expr1 ; -- no associativity
Fun . Expr ::= Pattern ":" Expr;
Record . Expr ::= "{" [Assignment] "}" ;
Atomic . Expr ::= Atom ;
Lit . Atom ::= String ;
Ref . Atom ::= Reference ;
Assign . Assignment ::= Reference "=" Expr ;
terminator Assignment ";" ;
Segment . Reference ::= [Symbol] ;
separator nonempty Symbol "." ;
SimplePat . Pattern ::= Symbol ;
RecordPat . Pattern ::= "{" [Argument] "}" ;
Variable . Argument ::= Symbol ;
VariableDef . Argument ::= Symbol "?" Expr ;
separator Argument "," ;
Record
and RecordPat
conflict.
It might be interesting to see if we can describe Nix as a parsing expression grammar (PEG), and, if so, evaluate the performance of a parser generated from such a PEG. In discussion with @peti and others, we came to the conclusion that we might be able to get away with finite lookahead if we restrict the maximum length of a Symbol.
Basically, when we encounter a {
token (assuming we have tokenized in such a way that ${
is a single token), we face the following possibilities:
@
. In this case, we are definitely parsing a function parameter list."
(i.e.: a string follows). In this case, we are parsing an attribute set.inherit
. In this case, we are also parsing an attribute set..
or =
, it's an attribute set.?
or ,
, it's a function parameter listI can't think of any other cases, but I am not an expert on the Nix grammar.
I've started implementing a PEG parser for nix based on PEGTL a while ago.
The grammar parses nixpkgs (nixpkgs/**/*.nix
), I'm currently working on the ast generation. But I think it's performance will be worse than the current flex/bison parser.
The ambiguity of attribute set and argument set causes backtracking and is of course a performance penalty.
The other situation where backtracking happens rather often is caused by native url parsing (xxx:
could be a function declaration with xxx
as parameter name, or the start of an uri, based on the following characters). Ref #1017
I can share the work if someone want's to have a look, but the current state is nowhere near readable :laughing:
@groxxda Please do share, I'd like to see it.
I think the URL parsing can be handled by special-casing (not in such a way that it compromises the correctness of the grammar; we can use take advantage of the assymmetric choice in PEG to do such optimizations) the scenario where the URI begins with http
or https
or git
. There's almost always whitespace after the :
of a lambda, so that could also serve as a hint to the parser.
@taktoa Grammar-Kit parsers are PEGs
Historical note: Once upon a time Nix used SGLR instead of Bison/Yacc. SGLR is a scannerless GLR parser, so lexical syntax is defined in the same formalism as the context-free syntax. I switched to Bison/Yacc because parsing speed wasn't quite good enough (you don't want a command like nix-env -qa
to spend a lot of time in parsing). This is too bad, because things like anti-quotation are supported very naturally in SDF (the syntax definition formalism used by SGLR).
Here is the last version of the SDF grammar: https://github.com/NixOS/nix/blob/c41a3ec3a9ebf0bfde5fe23fa10d800afe2d9bd9/src/libexpr/nix.sdf
Nix used libsglr in C++, but you can also use it from to command line to get an AST in ATerm format (e.g. OpEq(Var("x"), Int(123))
).
@edolstra thanks for the insight :+1: Do you have any (old) benchmarks lying around that show the difference between SGLR vs Flex/Bison?
My current state on the parser implementation with PEGTL is the following:
I think I'm finished with AST generation. For benchmarking, I simply let my implementation parse nixpkgs/**/*.nix
.
It would be great to have a reference, how long the current parser takes for the same files. Since the current parser does a few more things than parsing (prefix stripping for multiline strings, variable substitution, attrset building), it's hard to compare performance.
For a comparison I tried nix-instantiate --parse -E nixpkgs/**/*.nix > /dev/null
but that seems to do way more than only parse because it takes more than a minute on my machine.
I'd be very thankful if you have an idea how I could get more meaningful numbers for the current parser.
I've created a tool that parses Nix files provided on the command line using the current Bison-based parser, but does not print the output expression. Running this command with all of nixpkgs as input gives roughly 2 seconds of runtime, so presumably that's our target for performance.
Currently, nixexpr (the parser @groxxda is working on) seems to parse all of nixpkgs in about 2.3 seconds, but currently does not desugar its AST; in the current Flex/Bison parser, we do a lot of desugaring and correctness-checking (e.g.: for unbound variables) before we escape the parser. I think we should split these phases; it may cost some performance when we evaluate everything, but it will save some performance when we only want to evaluate a small portion of nixpkgs (and this is usually the case).
Also, by splitting parsing into separate parsing and checking phases, we can use the parsed output for things like editor integration, which need to handle partially incorrect input and need an AST that can be pretty printed back into the text it was parsed from without data loss. Plus, having the AST parser and pretty printer be exact inverses makes them easy to test with things like autocheck.
Of course, when you are developing and want these checks to occur at "compile-time", rather than at evaluation time, you can just call nix-instantiate with an appropriate flag, so there's really not much loss.
BTW, @groxxda, I'm currently writing a little parser generator in Haskell that takes a PEG grammar and targets PEGTL (and also it tries to approximate a BNFC-like syntax). So far I've written an ADT for a PEG and I'm now mostly done writing a parser for that type (using trifecta).
If you could come up with BNFC-style rules (augmented with asymmetric choice, &
, and !
) corresponding to your current parser, I think I could generate a PEGTL parser roughly equivalent to the one you've written.
@taktoa Note there is a ABNF -> PEGTL converter written in PEGTL: https://github.com/ColinH/PEGTL/blob/master/examples/abnf2pegtl.cc
@groxxda According to c5baaafae69394082817ede9e6eb3910c4601a72, there was a 80x speedup in switching to Bison/Yacc.
I was about to bring up Earley but apparently #1491 .
That issue is actually about something different: the ease of writing parsers in Nix.
Can we close this issue?
Sure, though at some point it would probably still be a good idea to rewrite Nix's parser.
I completely agree. I think this ticket should be closed solely because I don't believe that keeping it open serves any purpose. That's not to say that the subject matter discussed in this ticket wouldn't be important -- it is!
Intro
I think we probably all agree that it would be nice to have a proper BNF for the Nix language, and that ideally the "official" parser would be compiled from that BNF. Here's why:
nix-parse
executable that either pretty-prints the AST or outputs it in JSON/XML/s-expression format).There is a tool that would allow for such a thing, called BNFC. It has a very BNF-like syntax and has many backends (C/C++ via Bison/Flex, Haskell via Happy/Alex, OCaml, LaTeX for documentation), and can even generate pretty-printers.
Obstacles
The obstacles I see to using BNFC for the Nix parser are:
Obstacle 1
Regarding obstacle 1, this is unavoidable but I might be willing to do a large amount of the work here (though it would be nice to have a point of contact on IRC for questions about the Nix internals).
Obstacle 2
Regarding obstacle 2: I think we mainly need an answer to the question "How much time is currently spent in the Nix parser versus other parts of evaluation?". If the answer is "not very much", then the speed of the generated parser will probably not be much of an issue. Otherwise, I would be willing to write a BNFC grammar and compare its speed to the current parser, assuming we agree on a solution for obstacle 3.
Obstacle 3
Regarding obstacle 3, I have thought of a few different options for mitigating the issue (I've put the "cons" associated with each option in bold, and any notable "pros" in italics):
Mitigations
nativeBuildInput
.-fvia-C
, check that into a repo, and have Nix depend on the built version of that code.-fvia-C
may not support the required extensions.patchelf
s in the libraries needed for GHC's runtime (GMP, libc, and libpthread, mainly).Personally, I would say that:
-fvia-C
or JHC.Summary
I might be willing to make the BNF for Nix and do the preliminary work to compare efficiency with the existing parser, iff people think this is a good idea.
The main open questions are:
-fvia-C
or JHC?