bitwalker / combine

A parser combinator library for Elixir projects
MIT License
197 stars 19 forks source link

Recursive grammars #12

Open obrok opened 8 years ago

obrok commented 8 years ago

So I was thinking about combine a bit recently and it seems to me unlike its haskell older brother it has a problem parsing recursive grammars. Take this example of a parenthesis matching parser that's "translated" from parsec's documentation:

defmodule Parens do
  use Combine

  def parser, do: many(parens())
  defp parens, do: sequence([char("("), many(parens()), char(")")])
end

Parens.parser of course results in an infinite loop. I think this works in parsec, because everything is a thunk in haskell so the particular parsers are only evaluated as needed. In this case however we're trying to build a whole infinite parser at once. Or maybe there's a way to achieve this currently?

sasa1977 commented 8 years ago

Not sure if it's the best approach, but here's how I quickly solved it:

defparser lazy(%ParserState{status: :ok} = state, generator) do
  (generator.()).(state)
end

Then, with this module:

defmodule Test do
  use Combine

  def parse(expr) do
    Combine.parse(expr, parens())
  end

  defp parens do
    sequence([char("("), many(lazy(fn -> parens() end)), char(")")])
  end
end
iex(1)> Test.parse("()")
[["(", [], ")"]]

iex(2)> Test.parse("(())")
[["(", [["(", [], ")"]], ")"]]

iex(3)> Test.parse("((((()))))")
[["(", [["(", [["(", [["(", [["(", [], ")"]], ")"]], ")"]], ")"]], ")"]]

iex(4)> Test.parse("(()")
{:error, "Expected `)`, but hit end of input."}

iex(5)> Test.parse(")")
{:error, "Expected `(`, but found `)` at line 1, column 1."}

iex(6)> Test.parse("((")
{:error, "Expected `)`, but found `(` at line 1, column 2."}

I wonder though if parsers should be lazy by default, so grammar devs don't need to worry about these internals. The major downside I see is that it might cause perf penalty. For the case we're working on that's not relevant, but for other cases (e.g. parsing dates in many simultaneous HTTP requests) it might be a problem.

bitwalker commented 8 years ago

Yeah it's a hard problem to solve. I've taken a couple of cracks at the performance problem, but always end up hitting something I don't like. I did a lot of work reducing the number of function calls each parser makes as much as possible, because when parsing large numbers of items, each additional function call has a noticeable impact - at least with things like datetimes/HTTP requests (hence the benchmarks in the repo). I don't think it probably makes much difference when parsing one large input though, which is part of the problem too - depending on your usecase, the optimization either helps a great deal, or doesn't have any noticeable effect.

I'd prefer to make laziness opt-in if possible, or the reverse, and make it possible to opt-out of laziness. Given @sasa1977 example, it's clear the former is easier to accomplish, and also has the added benefit of being backwards compatible. I'm certainly open to suggestions on it though. Keep in mind that timex currently uses combine for date/time parsing, so whatever changes we make, I'd like to at least keep the small parser/large set of items use case in mind.

sasa1977 commented 8 years ago

I see two options at the moment:

  1. Add the lazy parser.
  2. Make two versions for each module. For example, we could have Combine.Base.Eager and Combine.Base.Lazy. A dev would then use one or another to get macros with desired properties - eager for perf, lazy for recursive grammars.

I guess option 1 seems like less work, and more flexible, since devs can fine-tune by opting for laziness only where they really need it.

Somewhat surprisingly, in our project we still don't need to handle recursive grammars (though I'm sure it will happen eventually), so as far as we're concerned, we can postpone doing anything about this at the moment. It's still worth thinking about possible approaches, since @obrok makes a good point that it's not obvious how to parse a recursive grammar at the moment.

bitwalker commented 8 years ago

Good points. I think #1 at least gives us a path for tackling the problem in the near term, even if it's not ideal, since it gives us room to think about the best way to ultimately handle recursive grammars. If it was possible to get the performance of lazy-by-default to be equivalent to eager-by-default, that would be awesome, but my experiments in that endeavor have always ended in failure sadly.

ijcd commented 6 years ago

Happy to report that this seems to be working pretty well. It's a lot easier to type if you just put it at the top-most re-entrant item:

  def erl_typeterm() do
    lazy(fn ->
      choice([
        annotated_variable(),
        module_ref(),
        list_type(),
        call(),
        symbol(),
        variable()
      ])
    end)
  end

...

  def tuple() do
    sequence([
      ignore(string("{")),
      sep_by(erl_typeterm(), sep_and_whitespace(",")),
      ignore(string("}"))
    ])
    |> map(fn [terms] -> {:tuple, terms} end)
    |> make_ast(:tuple)
  end
bitwalker commented 6 years ago

Nice! One thing I've been exploring recently is implementing an Earley parser (I'm building one for my little language, and found that some interesting things could potentially be used here). Namely, we could compile the fully built parser to a grammar which can then be recognized/parsed by the Earley parser - the primary benefit here is that such a parser could handle any unambiguous LL(k)/LR(k) grammar in linear time, some ambiguous LR(k) grammars in linear time, and any context-free grammar in at-worse cubic time. Such parsers would thus be on par with things like yacc/bison - hand written recursive descent parsers will almost always be faster, and PEG/Packrat parsers are competitive with those, but have restrictions which make them not ideal for some cases, but neither of those are within the scope of what this project is about anyway.

I'm waiting to see how the parser for my compiler works out in practice, and if it seems like a good option, I should be able to port the code to Elixir quite easily, and rewrite the current internals of Combine to compile grammars, and add a new pair of backends for lexing/parsing based on those grammars (one for binary data, one for textual data).