DanilaFe / pegasus

A parser generator for C and Crystal.
MIT License
62 stars 3 forks source link

Precedence, backtracking and so. #4

Open KaneRoot opened 5 years ago

KaneRoot commented 5 years ago

Hello! First, congratulations about your program, I use it right now and I like the design. I do hope the software will be maintained!

Second, I do have a problem. Here is a document I would like to parse:

name:     hello
version:  2.10
versions: 2.10, 2.9, 2.8, 2.7, 2.6

# Lists.
sources:
    - https://my.awesome-project.org/hello/hello-%{version}.tar.gz

packager: John Doe <john@doe.com>
url:      https://my.awesome-project.org/software/hello/

flavors: nls, minimal

# These are blocks of free text.
@configure
    cd hello-%{version}
    ./configure

@build
    cd hello-%{version}
    make

@install
    cd hello-%{version}
    make DESTDIR="$PKG" install

Just a bit of context: it is a recipe for packaging an application.

A minimalistic version of this could be:

# comment
assignment: glob
list:
    - listitem

@freetextblock
    some
    instructions %{variable}

For now, my lexer and grammar look like this:

token cr = /\n/;
token eq = /=/;
token tabs = /\t+/;
token hash = /#/;
token at = /[@]/;
token dash = /-/;
token specialcharacters = /[_*+%"'$`|]+/;
token space = / +/;
token point = /[.]/;
token numbers = /[0-9]+/;
token string = /[a-zA-Z]+/;
token colon = /[:]/;
token coma = /,/;
token open_squarebracket = /[[]/;
token close_squarebracket = /[]]/;
token open_curlybracket = /[{]/;
token close_curlybracket = /[}]/;
token open_anglebracket = /[<]/;
token close_anglebracket = /[>]/;
token slash = /\//;

rule S = lines;
rule lines = line cr lines | cr lines | line;
rule line = freetextblock | assignment | list | assignment comment | comment;
rule freetextblock = at multiplealphanum cr freetext;
rule freetext = tabs hashglob cr freetext?;
rule assignment = string colon glob;
rule list = string colon cr listitem;
rule listitem = tabs dash glob cr listitem? | tabs comment cr listitem;
rule multiplealphanum = alphanum multiplealphanum?;
rule comment = hash glob;

rule hashglob = hash hashglob | any hashglob | any | hash;
rule glob = any glob | any;
rule any = alphanum
    | space | point | coma | colon | eq | at | slash | dash
    | specialcharacters
    |  open_squarebracket  |  open_curlybracket  |  open_anglebracket
    | close_squarebracket  | close_curlybracket  | close_anglebracket;
rule alphanum = string | numbers;

This grammar works but I have to do a carriage return after a listitem or freetext. This carriage return is mandatory but if I want to remove it in the rule I will have shift-reduce problems (this is probably conflicting with lines). Same thing for the freetext rule.

I would like to replace them by something like:

rule freetext = tabs hashglob cr freetext | tabs hashglob;
rule listitem =
    tabs dash glob cr listitem
    | tabs dash glob
    | tabs comment cr listitem;

I think that it could be simpler if I could prioritize a branch instead of another. In this example, prioritize listitem instead of lines so my grammar tree would look like this:

S -> lines -> line -> list -> listitem -> listitem -> listitem
     lines -> line -> freetextblock -> freetext -> freetext
     lines -> line -> assignment -> ...

So. Is there a way not to require a carriage return at the end of a freetext and listitem, and still have a good grammar tree?

As I understand, grammar tools often add precedence, backtracking or other mechanisms to simplify the grammar. I think it could be useful in my case. Is this on the roadmap? :)

Thanks a lot for reading me up to this point, and thanks for the help.

DanilaFe commented 5 years ago

Hi!

Thanks a lot for using pegasus. I'm still trying to understand the specifics of your grammar, but from what I understand, you'd like to give precedence to certain rules in your grammar to prevent conflicts - specifically, in this case, listitem conflicts with lines. This appears to be possible at first glance. I've not run into issues like this, which is why I had not considered adding precedence.

Backtracking, on the other hand, is trickier. Pegasus uses the LALR(1) parsing algorithm, which does not backtrack. Certain tools (bison, for sure) use GLR rather than LALR(1), which may allow backtracking (or at least something resembling backtracking). I would rather not switch pegasus' parsing algorithm, as this is a lot of work.

Thanks again for your interest, and please let me know if I understood you correctly!

KaneRoot commented 5 years ago

First, thanks for your quick reply. Second, yes, you did understand.

There must be a way of doing what I want with LALR(1), but I'm no expert. I have to investigate further. This problem is not related to pegasus, much more to LALR(1) grammars in general. Precedence could help in this case, I think. I will follow your project and improve my grammar when (or if) this feature comes up.

Do you have an example of grammar you use for a real project, maybe a real-world language grammar?

(further details on my problem, feel free to skip)

I just want to get items of a block in the same branch of the tree. So a list as:

sources:
    - url1
    - url2
    - url3

Should be seen as a simple branch containing all elements in order.

Same thing for a block of free text such as:

@build
    cd %{build_dir}/
    make

These blocks are well-matched since the free text block must start with an @.

This currently works with my grammar, but I did included a mandatory carriage return for each item of these blocks (see listitem and freetext rules). I did so to avoid going back to the line rule instead of continuing respectively in the freetextblock and list branches. But this forces me to add a carriage return after each block.

List items and free text lines are no different grammars, so it will conflict if there are no part of a block anymore. (I can technically hack this, since free text means "shell commands" in my case, so it should not start with a dash, but this is not elegant at all).

DanilaFe commented 5 years ago

I'm sure there's a way to get it working with LALR, but whether it's convenient without modification to the algorithm, I'm not sure. If precedence is a simple enough modification to make (it appears as such at first glance), I'd be happy to implement it so that defining grammars is more convenient.

I'm not sure if this is "real-world" enough, but here is my grammar for a functional language compiler that I wrote for a class project. Check out the programs folder to see examples of what it parses.

Regarding your problem: you add the cr as a delimiter for listitem, because if it's not there, the parser is unsure of whether to continue parsing listitem or to reduce and go back to parsing line? Rule precedence would certainly help resolve the conflict if that's the issue, but I can see it potentially leading to unexpected results. Nonetheless, it's a good idea to experiment with precedence as a feature.