NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
11.48k stars 1.44k forks source link

Declarative Grammar for the Nix Language #1102

Open taktoa opened 7 years ago

taktoa commented 7 years ago

Intro

I think we probably all agree that it would be nice to have a proper BNF for the Nix language, and that ideally the "official" parser would be compiled from that BNF. Here's why:

There is a tool that would allow for such a thing, called BNFC. It has a very BNF-like syntax and has many backends (C/C++ via Bison/Flex, Haskell via Happy/Alex, OCaml, LaTeX for documentation), and can even generate pretty-printers.

Obstacles

The obstacles I see to using BNFC for the Nix parser are:

  1. The BNF would have to be written. This doesn't seem too hard but someone has to do it. There are a few existing Haskell parsers that one could look at for inspiration, and the Nix expression syntax isn't so complicated anyway. The Nix source code will probably have to refactored to deal with the new intermediate representation, as well.
  2. The parser thus generated could be too slow for nixpkgs – we're already dealing with some major evaluation time problems.
  3. Naively, this would introduce GHC into the compile-time closure of Nix.

    Obstacle 1

Regarding obstacle 1, this is unavoidable but I might be willing to do a large amount of the work here (though it would be nice to have a point of contact on IRC for questions about the Nix internals).

Obstacle 2

Regarding obstacle 2: I think we mainly need an answer to the question "How much time is currently spent in the Nix parser versus other parts of evaluation?". If the answer is "not very much", then the speed of the generated parser will probably not be much of an issue. Otherwise, I would be willing to write a BNFC grammar and compare its speed to the current parser, assuming we agree on a solution for obstacle 3.

Obstacle 3

Regarding obstacle 3, I have thought of a few different options for mitigating the issue (I've put the "cons" associated with each option in bold, and any notable "pros" in italics):

Mitigations

  1. We could have the BNFC-based parser in addition to an in-tree parser that is used as a fallback in case BNFC is not available.
    • Could lead to the BNFC parser falling out of sync with the in-tree parser.
    • We now need to maintain two syntax definitions.
    • Could have test code that uses QuickCheck or similar to generate definitions with the BNFC parser and tests if both parsers output the same AST, though this would not cover cases where syntax was added to the non-BNFC parser but not to the BNFC-based parser.
  2. We could add a commit hook and Travis CI test that forces pre-compilation of the BNFC syntax, and check in the generated parser C++ source as well as the BNF file.
    • Adds nothing to the build/runtime closure of Nix.
    • Could be annoying for developers of the Nix package manager.
  3. We could say "screw it" and have the compiled syntax as a nativeBuildInput.
    • Breaks native builds of Nix on platforms where GHC doesn't build.
  4. We could compile BNFC to C via JHC or GHC -fvia-C, check that into a repo, and have Nix depend on the built version of that code.
    • Adds nothing to the build/runtime closure of Nix.
    • JHC seems unmaintained (last commit is 1 year old).
    • -fvia-C may not support the required extensions.
    • There isn't yet an unregistered build of GHC in nixpkgs, AFAICT.
    • Adds minor maintenance hassle.
  5. We could compile BNFC to JavaScript using GHCJS, check that into a repo, and have Nix depend on that code plus Node.
    • Adds node.js to the build-time closure of Nix.
    • Adds minor maintenance hassle.
  6. We could compile BNFC to LLVM assembly using GHC's LLVM backend, check that into a repo, and have Nix depend on that plus LLVM. This would probably be a good opportunity to switch Nix's build over to Clang, since we are depending on LLVM anyway. Preferably, we'd at least disassemble the LLVM bitcode before checking into a git repo, so that we can diff changes.
    • Adds LLVM and potentially Clang to the build closure of Nix.
    • Adds minor maintenance hassle.
    • Could remove GCC from Nix build closure if we switch to Clang.
    • Clang is a native cross-compiler, so this might speed up cross-compilation of Nix (though I don't think anyone does this?).
  7. We could precompile BNFC (with static linking) and then make a derivation that fetches the binary and patchelfs in the libraries needed for GHC's runtime (GMP, libc, and libpthread, mainly).
    • Adds a non-source package to the build closure of Nix.
    • Adds maintenance hassle.
    • Potentially annoying to make portable.

Personally, I would say that:

I might be willing to make the BNF for Nix and do the preliminary work to compare efficiency with the existing parser, iff people think this is a good idea.

The main open questions are:

  1. Is this a good idea?
  2. How much time is spent in the Nix parser versus other parts of evaluation?
  3. Which of the above mitigations is best for addressing the Nix build closure issue?
    • Can BNFC compile under -fvia-C or JHC?
    • How hard is it to compile Node for non-x86_64 platforms, and how big is its closure?
    • How hard is it to compile LLVM for non-x86_64 platforms, and how big is its closure?
    • Would the developers of Nix be willing to deal with installing BNFC and a commit hook?
taktoa commented 7 years ago

Looks like the BNFC source requires GADTs, DataKinds, etc. (full list here), so JHC is almost certainly not workable.

I think compiles using -fvia-C are not as portable as I thought, since the GHC runtime contains assembly code.

Not sure about the portability of GHC-generated LLVM bitcode... probably has the same issue as -fvia-C.

GHCJS + Node should still definitely be workable, assuming we are okay with adding Node to our build closure (AFAIK it's available on pretty much every platform, considering that it is based on V8, so bootstrapping shouldn't be a big deal, and its runtime closure is only 30 MB).

domenkozar commented 7 years ago

cc @Gabriel439

domenkozar commented 7 years ago

How much time is spent in the Nix parser versus other parts of evaluation?

nix-instantiate has a --parse flag, you could measure that time on nixpkgs source tree.

Can BNFC track state for our string escaping logic? See https://github.com/Gabriel439/nixfmt/issues/3

taktoa commented 7 years ago

@domenkozar Having now sufficiently played around with BNFC, this is an issue. BNFC supports declaration of comment syntax, so I find it reasonably likely that they would be willing to consider adding support for custom string syntax and antiquotation. The main issue is figuring out how to compile it away while still supporting as many backends as possible. I'll take a look at emailing their mailing list.

vcunat commented 7 years ago

2: Performance. Sometimes we do parse quite large files that are mainly JSON-like data describing even thousands of packages. I expect these are relatively easy on the evaluator and it wouldn't be nice if it should slow down significantly. In as slightly different all-packages.nix case, the file is huge (to parse) but typically only small parts of it are actually interpreted, thanks to laziness.

3: Making nix depend on Haskell stuff could be problematic, but IMHO it wouldn't be a significant obstacle if we could lift that dependency in released tarballs (by distributing the generated stuff within; few enough people will want to patch the parser).

x: Some small corners of the syntax do feel unintuitive to me, but I expect most users won't appreciate a BNF-like description. True parser "bugs" seem very rare to me.

Ericson2314 commented 7 years ago

How fancy a grammar do we need? I suppose we could write a grammar that can be compiled to C++ by C++ and compiled to Haskell by Haskell.

taktoa commented 7 years ago

Since I now realize that Nix does not have a context-free grammar (due to string antiquotation), I think there isn't much merit to having a BNF describing the syntax, unless BNFC gets explicit support for string antiquotation.

That said, I do think there is room for improvement in the Nix expression parser. Particularly:

There are some less difficult changes we could make to the Nix expression parser that may still be worthwhile:

  1. I think it would be very useful to have an AST representing the expression exactly as parsed, which would allow for better editor support etc. Ideally we could even have a C library/wrapper that exports our parser and its output AST, thus allowing other languages to bind to our API. We would then have a function that transforms this AST into a simpler intermediate representation, which could then be evaluated.
  2. Since we'd be mostly rewriting the parser anyway, we could switch to something like Boost Spirit and get more intuitive (i.e.: not LALR, which IME breaks most peoples' brains) syntax and maybe faster parse speeds as well (apparently their numeric parsers beat hand-written parsers like atof). I'm less confident about this suggestion.
Ericson2314 commented 7 years ago

Since I now realize that Nix does not have a context-free grammar (due to string antiquotation)

Hmm? I am quite surprised by this.

taktoa commented 7 years ago

At the very least, it requires fancy lexing to be parsed remotely efficiently. The meaning of " depends on the depth of previously parsed antiquotations, so you need a stack to lex it.

Ericson2314 commented 7 years ago

Sure that makes sense, but is a rather different from being context-sensitive :).

vcunat commented 7 years ago

Fancier lexing might work. I expect "somestring${ could be lexed (almost) as "somestring" + followed by a special kind of opening parentheses.

vcunat commented 7 years ago

IIRC it shouldn't matter that the } token would be used both for closing this one and for standard attribute sets/maps.

taktoa commented 7 years ago

Ericson2314: yeah you're definitely right; it was one of those situations where my inability to come up with a working BNFC led me to rationalize

vcunat: I had that thought, but what of ${ used outside strings? e.g.: let bar = "baz"; in foo.${bar}

vcunat commented 7 years ago

Hmm, yes that's also a correct antiquation :-) If the lexer does something like: "somestring${ -> "somestring" + ${, it shouldn't matter that antiquoting (i.e. ${ EXPR }) is also valid in some other cases. In the worst case, some bad usages could be caught during semantic analysis.

Ericson2314 commented 7 years ago

Yeah we just need context-free, not regular, lexing. This isn't so bad: you can think of s-expresion -> lisp and token tree -> Rust as an analogous composition of context-free grammars.

A "token tree" language for Nix, as @vcunat hints at, could look something like:

token-tree       ::= '(' token-tree* ')'
                  |  '[' token-tree* ']'
                  |  '{' token-tree* '}'
                  |  '"' string '"'
                  |  '\'\'' multiline-string '\'\''
                  |  # atomic tokens

string           ::= # empty nonterminal
                  |  non-$-or-"-char string
                  |  '$'
                  |  '\$' string
                  |  '$' non-{-or-"-char string
                  |  '${' token-tree* '}' string

multiline-string ::= # empty nonterminal
                  |  non-$-or-\'-char multiline-string
                  |  '\'' non-$-or-\'-char multiline-string
                  |  '$'
                  |  '\'\'${' multiline-string
                  |  '$' non-{-or-\'-char multiline-string
                  |  '${' token-tree* '}' multiline-string

the string stuff is really nothing but escape sequence handling---until you actually parse an antiquoted token tree its regular. I did a fairly ugly and probably not-quite-correct job---somebody with more experience in this area could clean it up.

In practice, I'd hope we could use some sort of macro / parameterized rule to deduplicate string and multi-line-string.

Gabriella439 commented 7 years ago

I think this should be a commit hook that generates C++ code to check in to minimize the closure size

This discussion also won't get very far until we actually implement a proof-of-concept parser using BNFC to get actual numbers on how fast it is compared to the current lparser

One other approach is to factor out Nix's lexer logic into a small C library that can be linked into other languages since the lexer contains most of the complexity. This C library would just take an input string and output a stream of tokens. Once you have something that can generate the token stream it's much easier to write a parser in each downstream language.

Ericson2314 commented 7 years ago

@Gabriel439 one problem with that is the optimal way to stream is in fact quiet language-specific. There's great things Haskell and Rust do that are just too much of a pain to write in C (not sure about C++).

vcunat commented 7 years ago

C can "stream" well enough e.g. by invoking a callback with each token separately. If you inline these calls, it could be rather efficient. EDIT: but we're digressing, I think.

edwtjo commented 7 years ago

FWIW I've made an attempt at the BNF and LEX in NixIDEA using Grammar-Kit(extendedPin false). They're not perfect though, the lexer keeps a state and I had to manually add yy_*_state, but might help someone giving BNFC(which I've never used) a go.

I would also really like it if there was one canonical BNF/LEX definition to consume/work with.

peti commented 7 years ago

BNFC's LBNF grammar targets LALR(1) parsers generators which are not powerful enough to parse Nix. The one-token look-ahead cannot decide whether { ... } ought to be an attribute set or a function argument pattern. At the time the parsers sees the { character, it needs to know because both constructs have different syntax, but there's usually no way to know without looking at the token after the closing }. A simple NixExpr.cf grammar that demonstrates the problem is:

-- A simple (incomplete) grammar for the Nix language.

comment "#" ;
comment "/*" "*/" ;

token Symbol letter (letter | digit | '_' | '-')* ;
-- token QString {"''"} char* {"''"} ;
-- token URL letter+ {"://"} (char - [" \n\t;[]{}"])+ ;
-- token Path {"./"} (char - [" \n\t;[]{}"])+ ;

coercions Expr 10 ;

Atomic          . Expr10        ::= Atom ;
ConcatList      . Expr9         ::= Expr10 "++" Expr9 ;         -- right associative
ConcatString    . Expr8         ::= Expr8 "+" Expr9 ;           -- left associative
Not             . Expr7         ::= "!" Expr7 ;
Union           . Expr6         ::= Expr7 "//" Expr6 ;          -- right associative
Equal           . Expr5         ::= Expr5 "==" Expr5 ;          -- no associativity
NotEqual        . Expr4         ::= Expr4 "!=" Expr4 ;          -- no associativity
And             . Expr3         ::= Expr3 "&&" Expr4 ;          -- left associative
Or              . Expr2         ::= Expr2 "||" Expr3 ;          -- left associative
Imply           . Expr1         ::= Expr1 "->" Expr1 ;          -- no associativity
Fun             . Expr          ::= Pattern ":" Expr;
Record          . Expr          ::= "{" [Assignment] "}" ;
Atomic          . Expr          ::= Atom ;

Lit             . Atom          ::= String ;
Ref             . Atom          ::= Reference ;

Assign          . Assignment    ::= Reference "=" Expr ;
terminator Assignment ";" ;

Segment         . Reference     ::= [Symbol] ;
separator nonempty Symbol "." ;

SimplePat       . Pattern       ::= Symbol ;
RecordPat       . Pattern       ::= "{" [Argument] "}" ;

Variable        . Argument      ::= Symbol ;
VariableDef     . Argument      ::= Symbol "?" Expr ;
separator Argument "," ;

Record and RecordPat conflict.

taktoa commented 7 years ago

It might be interesting to see if we can describe Nix as a parsing expression grammar (PEG), and, if so, evaluate the performance of a parser generated from such a PEG. In discussion with @peti and others, we came to the conclusion that we might be able to get away with finite lookahead if we restrict the maximum length of a Symbol.

Basically, when we encounter a { token (assuming we have tokenized in such a way that ${ is a single token), we face the following possibilities:

  1. The previous token was @. In this case, we are definitely parsing a function parameter list.
  2. The next token is " (i.e.: a string follows). In this case, we are parsing an attribute set.
  3. The next token is inherit. In this case, we are also parsing an attribute set.
  4. A symbol follows, in which case we have to look past the symbol (and, unless symbols are limited in length, this involves unlimited lookahead).
    • If the next token is . or =, it's an attribute set.
    • If the next token is ? or ,, it's a function parameter list

I can't think of any other cases, but I am not an expert on the Nix grammar.

groxxda commented 7 years ago

I've started implementing a PEG parser for nix based on PEGTL a while ago. The grammar parses nixpkgs (nixpkgs/**/*.nix), I'm currently working on the ast generation. But I think it's performance will be worse than the current flex/bison parser.

The ambiguity of attribute set and argument set causes backtracking and is of course a performance penalty.

The other situation where backtracking happens rather often is caused by native url parsing (xxx: could be a function declaration with xxx as parameter name, or the start of an uri, based on the following characters). Ref #1017

I can share the work if someone want's to have a look, but the current state is nowhere near readable :laughing:

taktoa commented 7 years ago

@groxxda Please do share, I'd like to see it.

I think the URL parsing can be handled by special-casing (not in such a way that it compromises the correctness of the grammar; we can use take advantage of the assymmetric choice in PEG to do such optimizations) the scenario where the URI begins with http or https or git. There's almost always whitespace after the : of a lambda, so that could also serve as a hint to the parser.

edwtjo commented 7 years ago

@taktoa Grammar-Kit parsers are PEGs

edolstra commented 7 years ago

Historical note: Once upon a time Nix used SGLR instead of Bison/Yacc. SGLR is a scannerless GLR parser, so lexical syntax is defined in the same formalism as the context-free syntax. I switched to Bison/Yacc because parsing speed wasn't quite good enough (you don't want a command like nix-env -qa to spend a lot of time in parsing). This is too bad, because things like anti-quotation are supported very naturally in SDF (the syntax definition formalism used by SGLR).

Here is the last version of the SDF grammar: https://github.com/NixOS/nix/blob/c41a3ec3a9ebf0bfde5fe23fa10d800afe2d9bd9/src/libexpr/nix.sdf

Nix used libsglr in C++, but you can also use it from to command line to get an AST in ATerm format (e.g. OpEq(Var("x"), Int(123))).

groxxda commented 7 years ago

@edolstra thanks for the insight :+1: Do you have any (old) benchmarks lying around that show the difference between SGLR vs Flex/Bison?

My current state on the parser implementation with PEGTL is the following: I think I'm finished with AST generation. For benchmarking, I simply let my implementation parse nixpkgs/**/*.nix. It would be great to have a reference, how long the current parser takes for the same files. Since the current parser does a few more things than parsing (prefix stripping for multiline strings, variable substitution, attrset building), it's hard to compare performance. For a comparison I tried nix-instantiate --parse -E nixpkgs/**/*.nix > /dev/null but that seems to do way more than only parse because it takes more than a minute on my machine.

I'd be very thankful if you have an idea how I could get more meaningful numbers for the current parser.

taktoa commented 7 years ago

I've created a tool that parses Nix files provided on the command line using the current Bison-based parser, but does not print the output expression. Running this command with all of nixpkgs as input gives roughly 2 seconds of runtime, so presumably that's our target for performance.

Currently, nixexpr (the parser @groxxda is working on) seems to parse all of nixpkgs in about 2.3 seconds, but currently does not desugar its AST; in the current Flex/Bison parser, we do a lot of desugaring and correctness-checking (e.g.: for unbound variables) before we escape the parser. I think we should split these phases; it may cost some performance when we evaluate everything, but it will save some performance when we only want to evaluate a small portion of nixpkgs (and this is usually the case).

Also, by splitting parsing into separate parsing and checking phases, we can use the parsed output for things like editor integration, which need to handle partially incorrect input and need an AST that can be pretty printed back into the text it was parsed from without data loss. Plus, having the AST parser and pretty printer be exact inverses makes them easy to test with things like autocheck.

Of course, when you are developing and want these checks to occur at "compile-time", rather than at evaluation time, you can just call nix-instantiate with an appropriate flag, so there's really not much loss.

taktoa commented 7 years ago

BTW, @groxxda, I'm currently writing a little parser generator in Haskell that takes a PEG grammar and targets PEGTL (and also it tries to approximate a BNFC-like syntax). So far I've written an ADT for a PEG and I'm now mostly done writing a parser for that type (using trifecta).

If you could come up with BNFC-style rules (augmented with asymmetric choice, &, and !) corresponding to your current parser, I think I could generate a PEGTL parser roughly equivalent to the one you've written.

groxxda commented 7 years ago

@taktoa Note there is a ABNF -> PEGTL converter written in PEGTL: https://github.com/ColinH/PEGTL/blob/master/examples/abnf2pegtl.cc

edolstra commented 7 years ago

@groxxda According to c5baaafae69394082817ede9e6eb3910c4601a72, there was a 80x speedup in switching to Bison/Yacc.

ghost commented 6 years ago

I was about to bring up Earley but apparently #1491 .

taktoa commented 6 years ago

That issue is actually about something different: the ease of writing parsers in Nix.

peti commented 6 years ago

Can we close this issue?

taktoa commented 6 years ago

Sure, though at some point it would probably still be a good idea to rewrite Nix's parser.

peti commented 6 years ago

I completely agree. I think this ticket should be closed solely because I don't believe that keeping it open serves any purpose. That's not to say that the subject matter discussed in this ticket wouldn't be important -- it is!