Predefined Identifier rule is wrong

jkm commented 12 years ago

I stumbled over strange behavior and I believe the cause is the built-in Identifier rule:

Identifier <~ Alpha Alphanum*

This rule ignores spacing. That's why a something is recognized as an Identifier which is wrong. The rule should be

Identifier <- ~(Alpha Alphanum*)

I'm not completely confident that I'm right. I'm sorry, for not writing a test case and verifying it further.

PhilippeSigaud commented 12 years ago

On Mon, Aug 20, 2012 at 12:46 PM, jkm notifications@github.com wrote:

I stumbled over strange behavior and I believe the cause is the built-in Identifier rule:

Identifier <~ Alpha Alphanum*

This rule ignores spacing. That's why a something is recognized as an Identifier which is wrong. The rule should be

Identifier <- ~(Alpha Alphanum*)

I'm not completely confident that I'm right. I'm sorry, for not writing a test case and verifying it further.

These two rules are identical, the same is generated (or should be). I tested the first definition of Identifier on "a name" and it matches only the "a" part.

Spaces are never consumed automatically, except if you use the special 'space-arrow': < (lower-than space).

Maybe you used this arrow where <- would be more appropriate?

jkm commented 12 years ago

I see. Thanks. I will check my grammar. I'm using < heavily. Probably this is the problem.

jkm commented 12 years ago

I can confirm that Identifier was used in rules with space-arrows. Unfortunately, in these rules there are other parts where I want the spaces to be ignored. That's why I'm using "my" Identifier rule now.

PhilippeSigaud commented 12 years ago

On Tue, Aug 21, 2012 at 10:47 AM, jkm notifications@github.com wrote:

I can confirm that Identifier was used in rules with space-arrows. Unfortunately, in these rules there are other parts where I want the spaces to be ignored. That's why I "my" Identifier rule.

OK. The new engine I'm writing doesn't have this diffusion of space-munching. In any case, I realized recently the definition is a bit restrictive: I limit the letters to a-zA-Z, but D allows for more (you can have UTF-8 identifier in your code with no problem). I'll let my own definition for now, though.

jkm commented 12 years ago

Very true. You have to have express universal alphas. BTW this also holds for C (at least since C99 probably even before that) and I believe even C++. Regarding parsing D: You also have to make sure that floating point literals are valid. At least http://dlang.org/lex.html mentions this. Currently I'm trying to encode left associativity (of operators) in my PEG. Because in PEGs left recursion is not allowed this is troublesome. You mention this on some tutorial page. But I believe it is possible to express.

jkm commented 12 years ago

Is it possible to encode left associativity directly in a peg? Everything I could come up with or found on the internet indicates that there is no possibility to express left associativity. How should one tackle this?

PhilippeSigaud commented 12 years ago

On Wed, Aug 22, 2012 at 10:52 AM, jkm notifications@github.com wrote:

Very true. You have to have express universal alphas. BTW this also holds for C (at least since C99 probably even before that) and I believe even C++.

I think I can use the xml code (file pegged.examples.xml2.d), which defines long series of UTF sequences. Those are universal alphas, I think.

Regarding parsing D: You also have to make sure that floating point literals are valid. At least http://dlang.org/lex.html mentions this.

I have a problem with float and integer literals with PEG. Right now, the D grammar gives precedence to integer, but the D grammar is made with the maximum munch principle, so 12.34 is tested as an integer, then as a float and the float is kept because it consumes 5 chars. So, I just changed my local copy of the D grammar: I put FloatLiteral before IntegerLiteral everywhere. The trouble then is that 12 is parsed as a valid FloatLiteral. So, I changed the float literal rule to make the decimal separator compulsory. So 12 is an integer, whereas 12. is a float.

Btw, I found half a dozen mistakes (left recursion, mostly) in my take on the D grammar. No doubt there is much more. But at least now it parses

module mymod;

int i,j;

int foo(int i, int j) { return i+j;}

void main()
{}

3 hours ago, it choked on the function declaration.

I still have trouble to understand the way the grammar deals with double[][] d;. The published grammar on the website does not make sense.

PhilippeSigaud commented 12 years ago

On Wed, Aug 22, 2012 at 4:49 PM, jkm notifications@github.com wrote:

Is it possible to encode left associativity directly in a peg? Everything I could come up with or found on the internet indicates that there is no possibility to express left associativity. How should one tackle this?

I don't know. Indeed, the arithmetic grammars I use tend to parse 1 - 2 - 3 - 4 as 1 - (2 - (3 - 4)). One easy way would be to act on the parse tree. Use a tree transformation at the end, or maybe insert a semantic action?

Damn, that's troubling and gives a bad impression with my arithmetic examples.

jkm commented 12 years ago

Yes. The specification on floating point literals is quite complicated. I'm not sure whether it is even considered correct. Because when UFCS was implemented it was decided that 4.f is not a valid floating point literal anymore. Same goes for 4._0 etc.

jkm commented 12 years ago

Regarding left associativity I'm unsure as well. It really seems to be a general problem for pegs. I know I can always transform the tree. But I prefer to have everything encoded in the grammar. At least this would be my ideal solution. Trying semantic actions may work.

PhilippeSigaud commented 12 years ago

On Wed, Aug 22, 2012 at 5:41 PM, jkm notifications@github.com wrote:

Regarding left associativity I'm unsure as well. It really seems to be a general problem for pegs. I know I can always transform the tree. But I prefer to have everything encoded in the grammar. At least this would be my ideal solution. Trying semantic actions may work.

I'll have to read the Dragon book again, because LL(1) grammars have the same problem. IIRC, post-treatmen was a possible solution, apart from getting a precedence parser.

Of course, Pegged could be transformed (extended) to generate a precedence parser, or LL(1) or LALR(1).

jkm commented 12 years ago

LL(1) grammars have this problem with right associativity then, don't they? But you can have both left associative operators and right associative operators. In D ^^ is the only right associative operator. There is also *= etc. which are right associative. There must be a way to transform left recursion in right recursion and vice versa. Just have to find it ...

jkm commented 12 years ago

I believe I found the answer: It is always possible to express left recursion using right recursion and vice versa and keeping the languages equivalent. But it is not possible to have the same parse trees. That means either way you have to do some manual post processing. Since most binary operators are left associative an LR grammar should be preferred from this point of view.

Given the peg you then want to generate an LR parser most of the time.

I wonder how ANTLR handles left associativity since it generates an LL(*) parser.

Note, that I'm no expert. This may be all wrong.

PhilippeSigaud commented 12 years ago

On Thu, Aug 23, 2012 at 12:45 PM, jkm notifications@github.com wrote:

I believe I found the answer: It is always possible to express left recursion using right recursion and vice versa and keeping the languages equivalent. But it is not possible to have the same parse trees. That means either way you have to do some manual post processing. Since most binary operators are left associative an LR grammar should be preferred from this point of view.

That's what I had in mind also.

Given the peg you then want to generate an LR parser most of the time.

I plan to add LALR as another parsing engine as soon as I can. That might take a few months, though.

I wonder how ANTLR handles left associativity since it generates an LL(*) parser.

I also wondered. Note that other parser generators, like Bison / Yacc use LALR.

jkm commented 12 years ago

Maybe we should post our confusion on the D list.

What I didn't know is that from PEG you can generate both LL and LR parsers. Since you cannot express left recursion in PEGs how can you generate an LR parser, that is how can you generate a parser handles left recursion just fine? With CFGs you can express both left and right recursion.

PhilippeSigaud commented 12 years ago

On Thu, Aug 23, 2012 at 4:10 PM, jkm notifications@github.com wrote:

Maybe we should post our confusion on the D list.

You can also use Nick Sabalausky's Goldie, which is LALR.

What I didn't know is that from PEG you can generate both LL and LR parsers. Since you cannot express left recursion in PEGs how can you generate an LR parser, that is how can you generate a parser handles left recursion just fine? With CFGs you can express both left and right recursion.

Oh, I don't know how to do that with a PEG. But standard EBNF is OK to describe a grammar also, and generating an LALR(1) parser is a known process. If I ever do that, I guess Pegged will not be PEG-based. I should call if Paged instead :) The way a PEG grammar is read right now can easily be confgured to read EBNF instead. Heck, the major difference is changing '/' to '|' to express non-prioritized choice.

My main problem is then to lost what I found fun with PEGs: since the lexical and syntactic analysis is coupled in one operation (no lexer), composing grammars is possible and easy. I'm working on improving that btw. And composing grammars means doing with them what you can do with functions: decomposing your workload in small, easily-maintained parts that can be plugged into one another depending on the current need.

Also, that means getting a lexer somewhere...

Btw, there is also an old (2007) project called aPaGeD which is a parser generator in D1. I never tested it, tough.

jkm commented 12 years ago

Does Goldie support parsing at compile time?

I can imagine the benefits for combining lexing and parsing in one stage. But having them separate is also nice. But I'll guess one can write a PEG that somehow separates lexing and parsing. I remember reading that having significant whitespace for indentation (as in Python) can be implemented in the lexer only. In that sense I find the conceptual separation useful.

PhilippeSigaud commented 12 years ago

On Thu, Aug 23, 2012 at 5:06 PM, jkm notifications@github.com wrote:

Does Goldie support parsing at compile time?

I don't think so. AFAIK, Pegged is the only one to do that.

I can imagine the benefits for combining lexing and parsing in one stage. But having them separate is also nice.

Yes, if only for the simplification in the grammar (no need to code literals) and the speed improvements it brings.

But I'll guess one can write a PEG that somehow separates lexing and parsing.

I guess so. In fact, if std.lexer is incorporated into Phobos, it will deliver a token range. The only adaptation I have to make is then to get Pegged to accept a range of token as a valid input. The PEG grammar then only deals with tokens.

I remember reading that having significant whitespace for indentation (as in Python) can be implemented in the lexer only. In th at sense I find the conceptual separation useful.

I don't know. I did it with a grammar (not Python, but Markdown, I think), I don't know how to do that at the lexical stage.

jkm commented 12 years ago

Philippe Sigaud wrote:

On Thu, Aug 23, 2012 at 5:06 PM, jkm notifications@github.com wrote:

Does Goldie support parsing at compile time?

I don't think so. AFAIK, Pegged is the only one to do that.

And this is so great.

I can imagine the benefits for combining lexing and parsing in one stage. But having them separate is also nice.

Yes, if only for the simplification in the grammar (no need to code literals) and the speed improvements it brings.

But I'll guess one can write a PEG that somehow separates lexing and parsing.

I guess so. In fact, if std.lexer is incorporated into Phobos, it will deliver a token range. The only adaptation I have to make is then to get Pegged to accept a range of token as a valid input. The PEG grammar then only deals with tokens.

I hope P(a/eg)ged will make it at some point into Phobos. You are doing tremendous work.

I remember reading that having significant whitespace for indentation (as in Python) can be implemented in the lexer only. In th at sense I find the conceptual separation useful.

I don't know. I did it with a grammar (not Python, but Markdown, I think), I don't know how to do that at the lexical stage.

The section "How does the compiler parse the indentation?" at http://www.secnetix.de/olli/Python/block_indentation.hawk explains this superficially. But I think this works.

PhilippeSigaud commented 12 years ago

I thought about it a bit more while driving :)

Using the kleene star, it's easy to get a correct left-association. See below, and print the associated parse trees, if you wish. I added a little interpreter and some tests, to verify. It seems to work OK :)

import pegged.grammar;

mixin(grammar(`
Arithmetic:
    Term    < Factor (Add / Sub)*
    Add     < "+" Factor  # Only there to facilitate the interpretation
    Sub     < "-" Factor   # Ditto
    Factor  < Primary (Mul / Div)*
    Mul     < "*" Primary   # Ditto
    Div     < "/" Primary    # Ditto
    Primary < Parens / Neg / Number
    Parens  < :"(" Term :")"
    Neg     < "-" Primary
    Number  < ~([0-9]+)
`));

float interpreter(string expr)
{
    auto p = Arithmetic.parse(expr);

//    writeln(p);

    float value(ParseTree p)
    {
        switch (p.name)
        {
            case "Arithmetic.Term":
                float v = 0.0;
                foreach(child; p.children) v += value(child);
                return v;
            case "Arithmetic.Add":
                return value(p.children[0]);
            case "Arithmetic.Sub":
                return -value(p.children[0]);
            case "Arithmetic.Factor":
                float v = 1.0;
                foreach(child; p.children) v *= value(child);
                return v;
            case "Arithmetic.Mul":
                return value(p.children[0]);
            case "Arithmetic.Div":
                return 1.0/value(p.children[0]);
            case "Arithmetic.Primary":
                return value(p.children[0]);
            case "Arithmetic.Parens":
                return value(p.children[0]);
            case "Arithmetic.Neg":
                return -value(p.children[0]);
            case "Arithmetic.Number":
                return to!float(p.capture[0]);
            default:
                return float.nan;
        }
    }

    return value(p);
}

void main()
{
    assert(interpreter("1") == 1.0);
    assert(interpreter("-1") == -1.0);
    assert(interpreter("1+1") == 2.0);
    assert(interpreter("1-1") == 0.0);

    assert(interpreter("1+1+1") == 3.0);
    assert(interpreter("1-1-1") == -1.0);
    assert(interpreter("1+1-1") == 1.0);
    assert(interpreter("1-1+1") == 1.0);
    assert(interpreter("-1+1+1") == 1.0);

    assert(interpreter("(-1+1)+1") == 1.0);
    assert(interpreter("-1+(1+1)") == 1.0);
    assert(interpreter("(-1+1+1)") == 1.0);
    assert(interpreter("1-(1-1)") == 1.0);

    assert(interpreter("1*1") == 1.0);
    assert(interpreter("1/1") == 1.0);
    assert(interpreter("-1*1") == -1.0);
    assert(interpreter("-1/1") == -1.0);

    assert(interpreter("1+2*3") == 7.0);
    assert(interpreter("1-2*3") == -5.0);
    assert(interpreter("-1-2*-3") == 5.0);
    assert(interpreter("-1+2*-3") == -7.0);

    assert(interpreter("1/2/(1/2)") == 1.0);
    assert(interpreter("1/2/1/2") == .25);
    assert(interpreter("1-2*3-2*3") == -11.0);

    assert(interpreter("2*3*3-3*3+3*4") == 21.0);
    assert(interpreter("2*3*3-3*(3+3*4)") == -27.0);
}

The section "How does the compiler parse the indentation?" at http://www.secnetix.de/olli/Python/block_indentation.hawk explains this superficially. But I think this works.

Thanks for the link. (reads) OK, I see, nice idea!

jkm commented 12 years ago

Nice. That should work. I had something similar but I wanted to have a binary tree. I believe the above grammar will result in non-binary trees. But that should be actually fine. I will definitely try. I implemented Dijkstra's Shunting-yard algorithm since Wikipedia says that gcc implements expression parsing using operator precedence parsing (https://en.wikipedia.org/wiki/Operator-precedence_parser) for efficiency reasons. But I believe with your latest speed improvements I won't need that. Let's see.

PhilippeSigaud commented 12 years ago

On Mon, Aug 27, 2012 at 10:35 AM, jkm notifications@github.com wrote:

Nice. That should work. I had something similar but I wanted to have a binary tree. I believe the above grammar will result in non-binary trees. But that should be actually fine. I will definitely try.

Yes, my example grammar results in an n-ary tree. But binary is a bit restrictive: what about unary operations like +a or -a, !a ? What about ternary operations (? : ) and even function calls?

x < sin(y) && predicate(x,y,z)

I implemented Dijkstra's Shunting-yard algorithm since Wikipedia says that gcc implements expression parsing using operator precedence parsing ( https://en.wikipedia.org/wiki/Operator-precedence_parser) for efficiency reasons. But I believe with your latest speed improvements I won't need that. Let's see.

I'm interested, since I've heard of this algorithm, but I never implemented it myself. There is a small but common class of grammars (operator grammars, a bit larger than the name imply) that can be addressed by operator precedence and result in linear parsing algorithms.

My dream would be to decompose a grammar into simpler parts and adapt the resulting parser. Most grammars can already be partially addressed by a simple regular grammar (ie: regex). Only the recursive parts need to be context-free. So I guess, it's possible, particularly with an easy way for grammars to call one another.

jkm commented 12 years ago

Philippe Sigaud wrote:

On Mon, Aug 27, 2012 at 10:35 AM, jkm notifications@github.com wrote:

Nice. That should work. I had something similar but I wanted to have a binary tree. I believe the above grammar will result in non-binary trees. But that should be actually fine. I will definitely try.

Yes, my example grammar results in an n-ary tree. But binary is a bit restrictive: what about unary operations like +a or -a, !a ? What about ternary operations (? : ) and even function calls?

x < sin(y) && predicate(x,y,z)

A binary tree has at most two children. So unary operators are fine. I'm not sure about function calls (could be probably be modeled using the , operator). n-ary operators (with n > 2) are not needed in my case. BTW I tried your grammar. Is it possible to have a tree with proper rule names, i.e. Term should either be Add or Sub? I see that you currently fix this in the provided value function.

I implemented Dijkstra's Shunting-yard algorithm since Wikipedia says that gcc implements expression parsing using operator precedence parsing ( https://en.wikipedia.org/wiki/Operator-precedence_parser) for efficiency reasons. But I believe with your latest speed improvements I won't need that. Let's see.

I'm interested, since I've heard of this algorithm, but I never implemented it myself. There is a small but common class of grammars (operator grammars, a bit larger than the name imply) that can be addressed by operator precedence and result in linear parsing algorithms.

My dream would be to decompose a grammar into simpler parts and adapt the resulting parser. Most grammars can already be partially addressed by a simple regular grammar (ie: regex). Only the recursive parts need to be context-free. So I guess, it's possible, particularly with an easy way for grammars to call one another.

That is a nice idea. What I'm doing currently is to use Pegged to tokenize expressions (and let it handle all other parsing) and let the shunting yard algorithm do the expression parsing.

dlang-community / Pegged

Predefined Identifier rule is wrong #40