Add an Earley parser builtin

taktoa commented 6 years ago

Currently we have builtins.match for parsing strings using regular expressions. Perhaps it would be appropriate to also have builtins.earley function that takes an ABNF (or EBNF) grammar and a string and returns a parsed AST. One major use case would be validating extraConfig-style NixOS module options, since it's often much easier to write a BNF for a configuration language than it is to write separate options for everything. The Earley algorithm runs in cubic time in the worst case, but in most cases (e.g.: nearly all left-recursive LR(k) grammars) it will run in linear time. The strings we are parsing are usually short anyway, so the asymptotics don't mean much. This is a C++ Earley library that claims to parse 300 kLOC of C per second, and that only allocates 5 MB while parsing 10 kLOC of C.

At the very least, even if we don't make this a builtin, I think it would be a good function to add to nixpkgs (using import-from-derivation when we care about the parsed AST and using a derivation that will simply fail to build if the parse fails when we just want to validate a string). It would be neat to implement a NixOS module option type lib.types.bnf using this.

@grahamc rightly noted that it may be a UX problem if we do this validation by default (e.g.: "why can't I add this setting to my X11.conf! nix sucks!"), so perhaps it would be better to make separate options that are validated. In other words, instead of replacing an option like extraConfig with its BNF-validated counterpart, simply add extraConfigValidated as an extra option that sets extraConfig if the validation passes.

I'd be willing to write the patch if it would be merged, though I think it's not clear-cut whether this is the right way to do this feature.

some relevant IRC discussion

copumpkin commented 6 years ago

Any thoughts on what interface builtins.earley would have? It would need to take some sort of Nix representation of the grammar (as a string? a structured value might be cooler) and spit out a representation of the AST. If done properly this could make builtins.fromJSON a lot less magical and give us a few other things like it.

My fear is that this is a bit of a can of worms, and would have to be a Nix builtin to have remotely decent performance in practice. Then we're basically committed to keeping a fairly complicated dependency/lump of code in Nix for a long time.

taktoa commented 6 years ago

Yes, I think the BNF should perhaps be a structured value (then we can just write an ABNF parser using this structured syntax and then write a wrapper that takes a string).

copumpkin commented 6 years ago

Or combinators that produce that structured syntax somehow, so we can pretend to be cough certain other languages 😄

taktoa commented 6 years ago

😆 Also, I don't think an Earley parser is particularly complicated. According to sloccount, that library I linked before only has about 2000 lines of actual code in it (400 are yacc and 1600 are C++). The algorithm is pretty simple.

taktoa commented 6 years ago

If implementation complexity is the main concern, derivative parsing is quite simple (at least, when it is implemented in a lazy functional language) and has many of the advantages of Earley. The Racket implementation of derp (the parser library in the paper I just linked) is only 600 lines of code.

aszlig commented 6 years ago

This would be quite nice to have, and I've had a several use-cases for this in the past.

One example would be one of my deployments, where I feed configuration values to Erlang terms. So far I've written this in Nix but it would be way cleaner and less error prone in the way you're suggesting.

In this example I'm using it to parse IPv4/IPv6 addresses given in option definitions to the internal Erlang format and doing this using import-from-derivation (which I did before) would involve running a bunch of builds (in this deployment around 200) during eval.

infinisil commented 6 years ago

The one and only thing that's not possible to do without normal derivations is to have the parser actually produce a nix expression which would be used by the evaluation phase. However during our discussion on IRC it turned out to be very hard to find a good such use-case. Validating or transforming strings into other strings can be done with a normal derivation.

As a simple example, I implemented @aszlig's function for parsing (only v4) IP addresses here:

with import <nixpkgs> {};
let
  runHaskell = name: args: deps: code: pkgs.runCommand name ({
    buildInputs = [ (pkgs.haskellPackages.ghcWithPackages deps) ];
  } // args) ''
    runhaskell ${pkgs.writeText "${name}.hs" code}
  '';

  parseErlIpAddr = addr: builtins.readFile (runHaskell "hello" { inherit addr; } (p: [ p.iproute ]) ''
    import Data.IP
    import Data.List (intercalate)
    import System.Environment (getEnv)

    main = do
      inputIP <- getEnv "addr"
      let ip = fromIPv4 (read inputIP)
      let outputIP = "{" ++ (intercalate ", " . map show $ ip) ++ "}"
      out <- getEnv "out"
      writeFile out outputIP
  '');
in
  parseErlIpAddr "1.1.1.1"

$ nix-instantiate --eval
"{1, 1, 1, 1}"

As you can see, it's totally possible to do this kind of thing during the building phase. This even adds the benefit that the results are cached and unless the input changes, it won't have to be run again, making it much faster than having to run a parser 200 times on every build.. (sorry @aszlig, I'm still very impressed by your nix parser though 👍 )

I feel like pkgs/build-support could really use some functions for this kind of stuff, e.g. validate :: BNFString -> String -> Bool, transform :: HaskellLamdaString -> InputString -> String, etc.

And of course I just took Haskell as a simple example, you could use any package you want.

aszlig commented 6 years ago

@Infinisil: I'm using something similar for the tests, where I'm comparing the reference implementation against the Nix implementation.

But that's not the point here, the reason why it needed to be run 200 times (initially of course) was because there were around 200 different addresses. And running that every time during evaluation phase when there is an update in the dependency graph of Erlang or in your case Haskell (which is even more frequent) is quite a pain, because evaluation without IFD already takes around 2-3 minutes and with IFD it adds > 10 minutes.

edolstra commented 6 years ago

This sounds good to me in principle. It would be nice to have an RFC addressing in particular:

What the parser input and output will look like.
What parser library (if any) we should use.

infinisil commented 6 years ago

@aszlig My point is that what you were doing is entirely possible during build time, you could even write your own package that just contains a parser and the output format you need in any language, Haskell was admittedly a bad example for this due to its large closure.

If you need a nix expression as output, then you need a builtin, because no derivation could do that. My question is therefore: What's the use-case of being able to transform strings into nix-expressions that's not possible with derivations?

copumpkin commented 6 years ago

@Infinisil we've generally tried to avoid running builds during evaluation though. Calling readFile on the output of a derivation has basically the same downsides as IFD, which is why we don't do that sort of thing much except sometimes "at the root" of a build.

edolstra commented 6 years ago

Import-from-derivation is generally considered harmful for a number of reasons:

It's very slow. (This is exacerbated by https://github.com/NixOS/nix/issues/179.)
It doesn't work in read-only evaluation mode, so it breaks things like nix-env -qa.
It breaks Hydra, since it can cause very large, untracked builds in hydra-evaluator rather than hydra-queue-runner. (E.g. imagine the case where your import-from-derivation depends on a new stdenv.)

A controlled parser builtin would have none of these problems.

infinisil commented 6 years ago

@copumpkin @edolstra Didn't know about this. In that case I'm all in favor of this! I was just going to write some functions that did this using readFile, but now I realize that this would never be merged into nixpkgs, so I won't bother.

copumpkin commented 6 years ago

To be clear, I do want a nice, principled way of doing more general things during evaluation. But I think we need some deeper thought about fundamental building blocks (like #520) to really enable that in a good way, so for now I'll take smaller things that increase the number of cool things we can do in Nix without IFD 😄

aszlig commented 6 years ago

@Infinisil: Exactly that was what I was talking about, those IPv4/IPv6 addresses are parsed from a Nix expression back to a Nix expression, so without a builtin like the one proposed the options would be IFD or write it in Nix itself.

As a side note of this, I guess it would also be nice to have a counterpart to builtins.earley (or whatever it may be called) for pretty-printing.

taktoa commented 6 years ago

@aszlig I don't think you'd need a builtin for that; Nix is already pretty good at string interpolation, so you could just write it as a function.

Here are some (mostly out of scope, but potentially interesting) thoughts I just had about Nix, parsing, and builtins:

The main thing preventing someone from easily writing an Earley parser in pure Nix is the lack of a character datatype, which can create a big memory overhead for anything that wants to work over characters. For example:

$ du -h ./big.txt
85K ./big.txt
$ nix-instantiate --eval --strict -E 'with builtins; with rec { inherit (import <nixpkgs> {}) lib; x = readFile ./big.txt; }; deepSeq x (valueSize x)'
246998
$ nix-instantiate --eval --strict -E 'with builtins; with rec { inherit (import <nixpkgs> {}) lib; x = lib.stringToCharacters (readFile ./big.txt); }; deepSeq x (valueSize x)'
8397106

Not to mention the fact that the second command takes 5 seconds whereas the first takes so little time it's in the noise floor of my timer (GNU time).

It does occur to me however that you could probably hack something together using substring (assuming that substring works by cheaply slicing) that does a streaming fold over a Nix string in constant memory.

It seems to me that the only real reason to worry about adding more builtins is the possibility of other Nix evaluators existing (which seems reasonably likely to me: see hnix). Supporting all the builtins is difficult enough on its own, let alone getting exactly the same behavior. I think the real solution here may be to have Nix compile down to a bytecode (which could ideally then be JITed). In the mean time, builtins is absolutely a useful hack.

A major open question related to this idea is what to do about unicode. AFAIK currently there is no way to put a unicode character in a string in Nix other than copy-and-pasting; this is already a big issue for writing correct builtins.match regexes. You can't even get around this with fromJSON:

nix-repl> builtins.fromJSON "\"\\u1234\""
error: \u characters in JSON strings are currently not supported

Solving this requires, at the very least, a builtins.setByteAtIndex :: Byte -> Int -> Text -> Text function or better support for unicode escapes in string literals.

taktoa commented 6 years ago

Similarly to the lib.types.bnf idea, perhaps it would be a good idea to write Nix function that takes a JSON Schema as input (Nix is practically a superset of JSON, so a Nix value would work equally well) and returns a NixOS module option type that validates against that schema. This wouldn't require any new builtins.

taktoa commented 6 years ago

Also, I've started work on an RFC here.

nbp commented 6 years ago

Before thinking of integrating a parser (and wasting multiple years), can we first think of making a tokenizer? From what I understand the next release of Nix is going to break the rust-overlay, because the builtins.match implementation regressed by adding a dummy limitation.

If we have a tokenizer, then we can iterate over making such generic parser in Nix and see if there is a use case for that.

Also note that a tokenizer will be faster, as this can help avoid the creation of temporary attribute sets, for the grammar input, and for the result of the parser, to convert it into some attribute set which is easier to manipulate.

I already proposed a simple to implement way of doing so in #1479, and this would solve a problem which is actually a regression from previous Nix version.

taktoa commented 6 years ago

@nbp I find it unlikely that a tokenizer would be considerably easier than a parser. In my experience people tend to overestimate the difficulty of using Earley/GLR/... algorithm implementations.

nbp commented 6 years ago

@taktoa I honestly do not think that a parser would be that hard to write in Nix, strings manipulation is the most costly thing, but once done correctly everything should be ok.

Speaking about performance, the parseTOML.nix file can parse 224KB of TOML file in 1.05s, and constructs an attribute set which represents the same data. (0.92s with the builtins.split function)

So, having a prototype written in Nix is a way to find issues before it gets implements in C++.

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

milahu commented 1 year ago

parseTOML.nix file can parse 224KB of TOML file in 1.05s

more "pure nix" parsers

https://github.com/nprindle/nix-parsec - parser combinator in nix
https://github.com/DavHau/fromYaml - yaml parser in nix
https://github.com/arcnmx/nixexprs/blob/master/lib/from-yaml.nix - a simple yaml parser in nix
https://github.com/nix-community/npmlock2nix/pull/29 - yarn.lock parser in nix

based on https://github.com/search?l=Nix&type=Repositories&q=parse

edit: added to https://nixos.wiki/wiki/String-parsing_in_Nix

NixOS / nix

Add an Earley parser builtin #1491