Whitespace woes. - Githubissues

chadjoan commented 12 years ago

enum string testGrammar = ` 
TestGrammar:

Root < 'a' '.'

Spacing <- blank*
`;

import pegged.grammar;
import std.stdio;

mixin(grammar(testGrammar));

void main()
{
    stdout.writefln("%s", TestGrammar.Root("a."));
}

I would expect the above grammar to recognize the example. Instead, it never halts.

On the other hand, this works:

enum string testGrammar = ` 
TestGrammar:

Root < 'a' '.'

Spacing <- blank+
`;

import pegged.grammar;
import std.stdio;

mixin(grammar(testGrammar));

void main()
{
    stdout.writefln("%s", TestGrammar.Root("a."));
}

/+
Prints:
TestGrammar.Root [0, 2]["a", "."]
 +-literal!("a") [0, 1]["a"]
 +-literal!(".") [1, 2]["."]
+/

This surprises me because the Spacing symbol explicit requests at least 1 blank, yet the text it recognizes has zero blanks.

There's another issue that might be the same thing: I am unable to place the right-hand-side of rules on a different line than the lhs like I used to:

enum string testGrammar = ` 
TestGrammar:

Root <
    'a'
    '.'

Spacing <- blank+
`;

import pegged.grammar;
import std.stdio;

mixin(grammar(testGrammar));

void main()
{
    stdout.writefln("%s", TestGrammar.Root("a."));
}

/+
During compilation:
test.d(14): Error: static assert  "Pegged (failure)
 +-Pegged.Grammar (failure)
    +-Pegged.GrammarName [2, 13]["TestGrammar"]
       +-Pegged.Identifier [2, 13]["TestGrammar"]
    +-oneOrMore!(Pegged.Definition) (failure)
       +-Pegged.Definition (failure)
       |  +-Pegged.LhsName [16, 20]["Root"]
       |     +-Pegged.Identifier [16, 20]["Root"]
       |  +-Pegged.Arrow (failure)
       |     +-literal!("< ") failure at line 3, col 5, after "ar:

Root" expected "< ", but got "<
        'a'
        '."
"
+/

This makes it difficult to align rules that are best represented vertically:

`
Branch < '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

It might not be clear what's going on there, but there is no possible way for me to tab the alternations over to line up with the '^' terminal (nor would I want to: tabs are terrible/evil for alignment, but great for indentation). I'd rather just put the entire rhs into its own indentation level. That way my editor won't choke on spaces that don't match an indentation level. I'd like to be able to write it this way:

`
Branch < 
      '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

Or, in explicit form:

`
Branch < 
{tab}{space}{space}'^' ^identifier
{tab}/ '->' Node
{tab}/ '{' Node+ '}'
`

The Pegged grammar looks like it can handle this, but it doesn't.

Ideally I'd even be able to do something like this:

`
Branch < ThisSymbolNeverMatches
    / '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

which would entirely eliminate the desire to have tabs adjacent to spaces (tabs for indentation, spaces for alignment).

`
Branch < []==
    / '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

It's the hammer operator, because you can't touch this ;)

This is all from commit 0406fd19e6c6261f2adab213d57e91d608fcf8f9:

commit 0406fd19e6c6261f2adab213d57e91d608fcf8f9
Author: Philippe Sigaud <philippe.sigaud@gmail.com>
Date:   Wed Oct 17 21:15:21 2012 +0200

    Testing callumenator out-of-memory error.

PhilippeSigaud commented 12 years ago

I would expect the above grammar to recognize the example. Instead, it never halts.

Yes, there is no simple way to deal with space-consuming rules and a spacing rule that can match 0 space. I'll have a look at this again.

There's another issue that might be the same thing: I am unable to place the right-hand-side of rules on a different line than the lhs like I used to:

That's a regression because I love to use layout for my rules. Damn, I did not put multi-line rules in my unit-tests, that's why I didn't see it. OK, this one is killjoy, that will be my goal for tonight.

chadjoan commented 12 years ago

Awesome, thanks!

PhilippeSigaud commented 12 years ago

OK, I pushed some commits that should make things better.

The pegged grammar now uses predefined rules for spaces and such.
tabs and end of lines seem to work. See my unit test at the end of this post.
as a long-awaited addition, I made e* and e+ stop parsing when e returns an epsilon (something that matches the empty string). That kills infinite loops: for example with ('a'?)* or ('a'*)*, combinations that would otherwise loop, since 'a'? or 'a'* can succeed while not advancing.
I added a predefined parser name tab, matching \t.
btw: just define a 'unit' space with your Spacing rule: Spacing <- ' ' / '\t'. the space-arrow does the 'zero-or-more' part for you. I updated the docs to make that clearer.

Here is my test for multiline rules:

unittest // Multilines rules
{
    mixin(grammar(`
Indentation:
Rule1 <
'a'
    'b'
'c'
    Rule2
<-
 'd'
Rule3
<
'e'
Rule4 <- 'f' Rule5   # Rule4 ends with 'f', then it's Rule5
<- 'g'

    'h'
`));

    assert(Indentation("abc").successful);
    assert(Indentation.Rule2("d").successful);
    assert(Indentation.Rule3("e").successful);
    assert(Indentation.Rule4("f").successful);
    assert(Indentation.Rule5("gh").successful);
}

chadjoan commented 12 years ago

Works for me! Great!

I love to hear that this gets rid of a source of infinite loops. Those are very confusing. They make debugging difficult because there is no output from the parser in those cases. Good riddance!

I reread the part on space arrow too, and that does read better.

I did notice a confusing part of the documentation when reading about space arrow this time around:

< does not apply the spacing inside e*or e+. This is to avoid nasty bugs for a user caught unaware by space-consuming.

What nasty bug? I also wonder if it's wrong, because I have a rule like Node+ where "Node < ^identifier Branch*" and it happily eats spaces between the Nodes. It's a place where I feel like an example would be very enlightening.

PhilippeSigaud commented 12 years ago

< does not apply the spacing inside e*or e+. This is to avoid nasty bugs for a user caught unaware by space-consuming.

What nasty bug? I also wonder if it's wrong, because I have a rule like Node+ where "Node < ^identifier Branch*" and it happily eats spaces between the Nodes. It's a place where I feel like an example would be very enlightening.

But in this case, Node has space-consuming parts. I meant that with rules like this one:

Rule < "abc" "def"+

Then no space will be matched between the "def"'s.

The question is: as a user, did you think using '< ' would allow putting spaces between the "def"'s? My POV is 'no, let the user do that himself', as you did for 'Node'.

I'll update the docs, though, as I understand how they can be misleading.

callumenator commented 12 years ago

Rule < "abc" "def"+

Then no space will be matched between the "def"'s.

My brain re-writes that rule as: Rule < "abc" "def" ("def" ("def" ("def" ......)?)?)? which would have made me think that spaces would be matched between the "defs", so I would have totally got that wrong :)

chadjoan commented 12 years ago

I was about to say that I didn't really care how it worked, and kinda still don't, because it's easy enough to put blank+ inside repetition expressions. But, if the reasoning is because of intuition, then what callumenator just wrote has some weight.

I was going to say something like "I haven't actually encountered this case" but then realized that I actually did, but saved myself from it accidentally. Indeed, my brain parsed that as callumenator's does: any match of a symbol within a space-arrow rule may have space around it, including repeated matches caused by repetition. Removing that behavior can be done by factoring stuff into a separate rule with a different arrow type.

I would have expected

Root < A B* !.
A <- 'a'
B <- 'b'

to match "ab b bb b"

shrug\ I can write it either way myself, now that I know about it.

I have no idea which intuition would be more common for newbies.

PhilippeSigaud commented 12 years ago

Hey, if both of you expects < to insert spaces even between repeated elements, then so be it. I can change it. My reasoning was:

it's easy to add a space-consuming rule if you want to
with the other solution, you have no direct way to disable space-matching.

Yes, no?

chadjoan commented 12 years ago

Well, Philippe I think you experienced intuition running one way, and callumenator and I experienced it the other way. Our sample set is super small, but this suggests to me that there is no right way.

I think good design practice would say this: do what requires the least special-casing.

I suspect that would be matching spaces between repetition (yes), but I haven't looked at the code responsible for this bit, so I could be wrong.

PhilippeSigaud commented 12 years ago

Well, Philippe I think you experienced intuition running one way, and callumenator and I experienced it the other way. Our sample set is super small, but this suggests to me that there is no right way.

Well, since my customers are asking for space-insertion everywhere, then so be it. In any case, I can create special-case secondary rules:

Rule1 <  A B
B <- C+   # No space in between the C's

Btw, going on with the same thematic: <^ and <% do not distribute ^ and % over all subrules: they wrap the entire rule inside a ^/% call. In a way, the special arrows act like the op=operators in C. So my question is, given:

Rule <% A B C+

That right now does:

Rule <- %(A B C+)

Would you like it to do:

Rule <- %A%B (%C)+

Or not?

PhilippeSigaud commented 12 years ago

Well, Philippe I think you experienced intuition running one way, and callumenator and I experienced it the other way. Our sample set is super small, but this suggests to me that there is no right way.

Well, since my customers are asking for space-insertion everywhere, then so be it.

OK, done. I also added unit tests and a more generic way to modify grammar trees. That way, if someone wants a similar change (having ^ or ; be distributed by the associated arrows), it'll be easier to do.

callumenator commented 12 years ago

I use space consuming rules heavily, so I'm biased. I haven't used the other operators you showed, so I can't comment on them, but I think having the space arrow consume spaces between repeated elements is at least a bit more intuitive.

PhilippeSigaud commented 12 years ago

Now, spaces are inserted everywhere:

Rule1 < A (B ~(~B ~B)+)? C

You'll have Spacing before and after A, all B's and C. I think I could simplify this a little bit. I fear all these inserted parsers will slow down parsing a bit too much. I'd advise testing your parsing speed from time to time with a few manually-inserted Spacings instead of always counting on the space-arrow automaton.

I couldn't find a way to insert less Spacing, as in general they are needed everywhere.

chadjoan commented 12 years ago

Interesting.

This is heading towards something I was eventually going to post on in a different topic: arrow composition and the economics of tokens and operators.

Space arrow is a bit of a tough case because I can't have an arrow that is both a space arrow AND a drop arrow. It'd be nice if there was a token for explicity stating spacy-ness or lack of it. It'd also be nice to stack multiple operators on the same arrow.

As for distribution: I think the distribution is more useful than op(...), simply because it looks very easy to just use op(...) instead of <op when the former is desired, but when distribution is desired, doing everything by hand can be laborious and look terrible afterwards.

Propagation is an interesting operator too: I found a need for it because I wanted a way to define rules that don't survive tree decimation. The propagate operator comes close to this by allowing me to just spam it everywhere the offending rule appears, but it feels like it's on the wrong side of the caller-callee interface. I have a hard time suggesting that <% be re-purposed to mean that the defined rule should never appear in parse trees yet leave the children in: it would be inconsistent with all of the other arrows.

The thought of adding a spacing operator would probably make you sweat. I'm looking at peggedgrammar.d and it seems like you have @ and $ left. I wonder if the token usage could be optimized. There's also a small combinatoric explosion of semantics regarding what gets kept/discarded:

Matches:

Keep this and children. (drop?, keep)
Keep this but not children. (fuse?)
Don't keep this. Keep children. (propagate?)
Don't keep this or children. (discard)

Nodes:

Keep this and children. (keep)
Keep this but not children. (fuse?)
Don't keep this. Keep children. (propagate operator)
Don't keep this or children. (drop, discard)

Some of these might be important but not representable in pegged currently; at least not without some clever combos.

The good side of that is that there is possibly an opportunity to optimize token usage through compositionality.

PhilippeSigaud commented 12 years ago

Well first, I read somewhere that DSL most of the time go through a phase of Cambrian explosion when they are proposed to people other than their original user :) So beware of encoding too much stuff at the syntax level: many things can be obtained by using semantic actions (dropping children, dropping matches, ...)

Anyway, my thought went in the same direction: I'm adding too much stuff at the name-level (I mean, having specially named rules like "discard", "drop"), whereas this is only metadata.

Maybe a possibility would be to add a metadata field to the nodes. This would serve to store information like "should be discarded", "should be kept", etc. Then operators like :, ; or ^ would simply be semantic actions that push this information in the metadata field.

Then and (the sequence engine) and decimateTree would simply look into the metadata to know what to do.

The possibilities:

keeping this entire node or not.
keeping the matches of this not or not.
keeping the children of this node or not.
resetting the .end field or not (&, !)
invert success or not? (not sure about this one)

chadjoan commented 12 years ago

I suspect that I had very quickly internalized these operators as being metadata from the moment I looked that them. So, if I understand you correctly, then that's probably a really good model.

I'm not sure I understand the bit about semantic actions. It might be the same as what I meant about "compositionality". I'm taking it as meaning this: Description by syntax: have two operators, ^ and ; for keep and drop. Description by semantic actions: have an operator ^ for keep. The @ operator accepts another operator as an argument and negates it. Thus @^ means drop.

Maybe another operator-of-operators could be added that determines how deep it is. Maybe that would be $ and we'd call it the "deep" operator: it would make its subject act on all child matches (or nodes?). Yep, determining nodes vs matches is another dimension to describe.

So far this is just for communication's sake. I still don't have any suspicions about what the ideal operator syntax/semantics would look like.

chadjoan commented 12 years ago

Ah heck, I couldn't help but think about it a bit and take a shot.

Perhaps there could be constructs for allowing the grammar to define what the various prefixes mean:

define(:) {drop all nodes matches}
define(%) {drop shallow nodes matches}
define(~) {drop deep matches} & {keep shallow matches} & {drop all nodes}

Here the all flag means "both shallow and deep". Making a definition like

define(%) {drop shallow matches} & { keep shallow matches}

would be an error due to the contradiction.

If I was more clever right now I'd figure out a way to make it impossible to write contradictions.

As an added bonus, it's easy to keep existing grammars working: just define default operators, or write the current meanings in define() notation and provide a chunk of boilerplate to place at the top of the existing grammars.

I find space sensitivity and negation to be analagous to the * and + operators: they affect parsing success. They are also not strictly necessary for being able to represent all PEGs, but they are important for the convenience they provide. I think these are fundamentally different from the other operators that determine the format of the parser's output. As such, it might be asking a bit much to expect them to conform to the same syntax/semantics as other operators.

Of course, we could try anyways:

define ($) {consume Spacing before} & {consume Spacing after}
define (@) {consume :Spacing before} & {consume :Spacing after} & {negate}

PhilippeSigaud commented 12 years ago

What I mean by semantic actions: most of this can be (and is!) I implemented as a standard ParseTree-to-ParseTree function. So it can be put in a grammar as an action:

Rule <- Sub1 { discardChildren } Sub2 { keep, fuseMatches }

At one time, I even imagined getting rid of most operators and using only semantic actions. This has a some advantages:

easier for a newcomer to see what a grammar does
easier composability: a list of action is a composition SubRule { a, b, c} returns c(b(a(SubRule(input))).
easier to extend: as with any semantic action, define your own in D
very powerful: the entire D language is accessible! (that means, even C or whatever).

The only drawback is readability, and that's why I kept operators.

You know, most DSL (and Pegged is a DSLà end up adding more and more stuff, piling things upon a formerly simple syntax, forgetting their initial goal. Heck, you could say Perl is a DSL gone bad. Regexen with all their extensions are slightly ridiculous to my eyes.

PhilippeSigaud commented 12 years ago

This is also why I stopped parameterized rules where they are. I could add tuple parameters (Rule(A, B...) <-) but that would mean adding indexing. I could add constraints (Rule(A,B) if (some Condition) <-) but that would mean adding hald of a D parser inside Pegged.

We have to remember Pegged is embedded in D. We have the entire power od D accessible through simple external code.

chadjoan commented 12 years ago

I see what you mean now.

Maybe it would work well to allow the grammar to define operators in terms of semantic actions?

One thing that's bugging me with this is that "dropping" as a semantic action probably involves cutting nodes from an already-allocated tree. I'm kind of hoping that eventually pegged will be able to prevent undesired nodes from ever existing in the first place, possibly to the point where an entire parse tree can be lazily evaluated. Taken to the extreme, the ideal is to allow something like a SAX (xml) parser to be written in Pegged and have no-memory-required parsing (maybe some memory for memoization, if used). Of course, I'm not sure if this is one of your goals, but it seems like something that would be important for faster parsing and a greater acceptance of Pegged in the D community as a blessed tool for doing code manipulation.

PhilippeSigaud commented 12 years ago

One thing that's bugging me with this is that "dropping" as a semantic action probably involves cutting nodes from an already-allocated tree. I'm kind of hoping that eventually pegged will be able to prevent undesired nodes from ever existing in the first place, possibly to the point where an entire parse tree can be lazily evaluated.

And how would you know if parsing is successful? How would you know what other (further down the input) nodes are without exploring them in the exact order the grammar define them? Laziness is possible, but not without accepting a parse attempt might suddenly fail (and thus, you should not be used by other functions).

Taken to the extreme, the ideal is to allow something like a SAX (xml) parser to be written in Pegged and have no-memory-required parsing (maybe some memory for memoization, if used). Of course, I'm not sure if this is one of your goals, but it seems like something that would be important for faster parsing and a greater acceptance of Pegged in the D community as a blessed tool for doing code manipulation.

Code manipulation is done at compile-time. I don't think we are at a point where compile-time action can be a no-memory action.

chadjoan commented 11 years ago

I just realized that I didn't follow this up ;)

I'm going to close this (I think whitespace has been beating fairly well at this point) and open a better-named issue.

dlang-community / Pegged

Whitespace woes. #68