ChrisHixon / chpeg

Parsing Expression Grammar (PEG) bytecode parser/compiler library
BSD 3-Clause "New" or "Revised" License
13 stars 1 forks source link

Undefined behavior under nodes where tree building inhibited. #28

Closed ChrisHixon closed 2 years ago

ChrisHixon commented 2 years ago

Adding {L} to a rule has the effect of inhibiting parse tree building under the node created for that rule. However, some things are going to have a problem with that, and probably fail in strange/undefined ways. One such case is nested back-references, which require reference scope nodes:

Grammar:

ROOT      {L} <- CONTENT !.
CONTENT       <- (ELEMENT / TEXT)*
ELEMENT   {R} <- STAG CONTENT ETAG     # reference scope node
STAG          <- '<'  < $tag<TAG_NAME> > '>'
ETAG          <- '</' < $tag > '>'
TAG_NAME      <- [a-z]+
TEXT      {L} <- (![<] .)+

Input:

<a><b>foo<c><a><b><c>foo<a><b><c><a><b>foo<c><a><b><c>foo
</c></b></a></c>foo</b></a></c></b></a></c></b></a></c></b></a>

Result:

input:2:3:      Error: Syntax error in ROOT <REF,0>

Without the {L} on ROOT, this parses ok.

This is likely not just a problem with {L} nodes, {I} also inhibits tree building.

What else might have issues related to this ... things that require or assume a node is present?

TRIM, for one. Trim operates on the top node on the stack, assuming that node is the 'current node' (its rule's node.)

ChrisHixon commented 2 years ago

@mingodad

I believe the issue you discovered here: https://github.com/ChrisHixon/chpeg/issues/22#issuecomment-1169821880 is a case of this issue (https://github.com/ChrisHixon/chpeg/issues/28)

TRIM now has the auto-leaf feature, so {L} is automatically added to any rule with TRIM.

You have one rule with TRIM, NAME, then another rule with TRIM getting called via SKIP <- COMMENT <- COMMENT_LONG <- STRING_LONG from SKIP at the end of NAME. This second TRIM is overriding because there can only be one TRIM on a node; the second TRIM isn't getting its own node to work on since tree building has been locked(disabled), so you're getting the contents of the STRING_LONG TRIM in NAME.

NAME          <- !KEYWORD <NAME_PREFIX NAME_SUFFIX?> SKIP
SKIP          <- (SPACE+ / COMMENT)*
COMMENT       <- '--' (COMMENT_LONG / COMMENT_SHRT)
COMMENT_LONG  <- STRING_LONG
STRING_LONG  <-
      '[' $lsep<'='*> '[' LINEBREAK? <( ! (']' $lsep ']') . )*> ']' $lsep ']'

Hopefully that made sense. I can't think of a workaround at the moment, other than removing one of the TRIMs.

So there really is a class of issues related to tree building being inhibited.

mingodad commented 2 years ago

That makes sense and it can happen on any non trivial grammar .

mingodad commented 2 years ago

But removing one TRIM is not an option for the whole grammar, in this case Lua grammar shown here https://github.com/yhirose/cpp-peglib/issues/229#issuecomment-1169847228.

mingodad commented 2 years ago

Meanwhile I think that chpeg should emit an error message in cases like this one, something like:

input:xx:yy: Nested trim found.
ChrisHixon commented 2 years ago

There would need to be a lot of warnings including the other cases I mention in this issue. I'd rather devote time to fixing the issue. This is an issue only with extensions that expect certain nodes to be present... TRIM and REFS (back-references). So tree 'locking', which made sense without these extensions, has to be revisited; probably a new approach is needed. Another workaround is not using TRIM or nested backreferences (global scope should be okay). It's possibe to capture what you want without TRIM but it will require changing rules, new rules, and probably more AST Options ({[SIL]}), etc.

mingodad commented 2 years ago

But even then I don't think that is nice to silently accept a grammar and output broken AST, if there is lot's of warnings/errors the user (we) should be informed.

mingodad commented 2 years ago

Also see comments here https://github.com/yhirose/cpp-peglib/issues/229#issuecomment-1170013965

mingodad commented 2 years ago

A bit related see also this issue https://github.com/yhirose/cpp-peglib/issues/230#issue-1289025524

ChrisHixon commented 2 years ago

I think this issue is fixed for TRIM with this commit: https://github.com/ChrisHixon/chpeg/commit/06d10be934783b8c4af2479ba484fdbaab52d7dc

I'm still testing this change. It could use more testing if you have a chance. The playground is updated with this fix: https://chrishixon.github.io/chpeg/playground/

mingodad commented 2 years ago

Some grammars that somehow was working before isn't now, like the nelua (https://github.com/ChrisHixon/chpeg/issues/20#issuecomment-1169645706) with this input https://github.com/edubart/nelua-lang/blob/master/lib/hashmap.nelua: Output:

Parse failed with result: 1
input:23:26: Expected: ")" in funcbody

Input line with error:

-- Ceil integer division.
local function ceilidiv(x: usize, y: usize): usize <inline> ////!!!!<<< here
  return (x + y - 1) // y
end
ChrisHixon commented 2 years ago

I did blow away the nested TRIM handling with this fix, for example:

Grammar:

A <- < 'foo' <'a'> / 'bar' <'b'>  >

Input:

barb

This currently captures b, but according to previous behavior the outer < ... > should take precedence and capture barb I think the previous behavior is more correct than just letting whatever runs last overwrite the trim values.

mingodad commented 2 years ago

What if when inside a capture then ignore any other new capture creation ?

mingodad commented 2 years ago

Probably my last suggestion isn't good because when you cache a rule you need to invalidate it just in case it has any nested TRIMs.

ChrisHixon commented 2 years ago

What if when inside a capture then ignore any other new capture creation ?

That's essentially the effect of what happened previously, though inner levels are not ignored, it pushed each level onto the stack. I'd better restore that behavior, I think there may be other issues otherwise.

Packrat should be OK, nodes cached will end up with the trimmed values. But this is another thing to test.

I haven't investigated your error above, let's see if maybe it'll be fixed after I restore the previous nested trim behavior.

ChrisHixon commented 2 years ago

Here's the nested TRIM fix: be032f2df8dbc8d33e6a179dbd96dfc65c415ebd I was unable to recreate the nelua problem, that parsed fine for me before and after this fix.

mingodad commented 2 years ago

Yes the nelua is working now for me too, but a Lua grammar (https://github.com/ChrisHixon/chpeg/issues/20#issuecomment-1171051281) that was working stopped now (I'm investigating right now). With this input https://github.com/edubart/lpegrex/blob/main/lpegrex.lua . Output:

Parse failed with result: 3
input:389:50: Expected: "\t" in NAME

Offending line:

    Suffix = l.Cf(l.V"Primary" *
            ( S * ( l.P"+" * l.Cc(1, lmt.__pow)
                  + l.P"*" * l.Cc(0, lmt.__pow)
                  + l.P"?" * l.Cc(-1, lmt.__pow)
                  + l.P"~?" * l.Cc(l.Cc(false), lmt.__add)
                  + "^" * expect( l.Cg(Num * l.Cc(mult))   ////!!!!!<<<<< Offending line
                                + l.Cg(l.C(l.S"+-" * l.R"09"^1) * l.Cc(lmt.__pow)
                                + Name * l.Cc"lab"
                                ),
                            "ExpNumName")
                  + "->" * expect(S * ( l.Cg((String + Num) * l.Cc(lmt.__div))
                                      + l.P"{}" * l.Cc(nil, l.Ct)
                                      + defwithfunc(lmt.__div)
                                      ),
                             "ExpCap")
                  + "=>" * expect(S * defwithfunc(l.Cmt),
                             "ExpName1")
                  + "~>" * S * defwithfunc(l.Cf)
                  ) --* S
            )^0, function(a,b,f) if f == "lab" then return a + l.T(b) end return f(a,b) end );
mingodad commented 2 years ago

And using only the the subset with error as input we get:

Suffix = l.Cf(l.V"Primary" *
            ( S * ( l.P"+" * l.Cc(1, lmt.__pow)
                  + l.P"*" * l.Cc(0, lmt.__pow)
                  + l.P"?" * l.Cc(-1, lmt.__pow)
                  + l.P"~?" * l.Cc(l.Cc(false), lmt.__add)
                  + "^" * expect( l.Cg(Num * l.Cc(mult))
                                + l.Cg(l.C(l.S"+-" * l.R"09"^1) * l.Cc(lmt.__pow)
                                + Name * l.Cc"lab"
                                ),
                            "ExpNumName")
                  + "->" * expect(S * ( l.Cg((String + Num) * l.Cc(lmt.__div))
                                      + l.P"{}" * l.Cc(nil, l.Ct)
                                      + defwithfunc(lmt.__div)
                                      ),
                             "ExpCap")
                  + "=>" * expect(S * defwithfunc(l.Cmt),
                             "ExpName1")
                  + "~>" * S * defwithfunc(l.Cf)
                  ) --* S
            )^0, function(a,b,f) if f == "lab" then return a + l.T(b) end return f(a,b) end );

Error:

Parse failed with result: 3
input:11:47: Expected: HEX_NUMBER in Number
mingodad commented 2 years ago

The same Lua grammar replacing {I} by ~ in front of the rule works fine on cpp-peglib.

ChrisHixon commented 2 years ago

I can verify that works on master but fails on these latest TRIM fixes. The nature of the errors may indicate something strange happening in the VM. I'll check it out it more detail.

ChrisHixon commented 2 years ago

Note the result code: Parse failed with result: 3 This is a stack overflow. Pushing 6 items per IDENT call (vs. 4 previously) blows the stack with this grammar/input. I raised the default stack size to 2048 and it parses fine. The chpeg-trace -t1 is identical with master, commit before TRIM changes, and after.

ChrisHixon commented 2 years ago

https://github.com/ChrisHixon/chpeg/commit/987432f5e42773fc1e2e5c231299e1c73c7ca260

mingodad commented 2 years ago

Yes it's working now ! Thank you so much ! Maybe it's a good idea to extend the error message with a description of the error, otherwise we end up in auto mode and ignore anything result: \d+ because it's not relevant most of the time. Something like:

Parse failed with result: 3 -> Stack overflow
ChrisHixon commented 2 years ago

Yes it's working now ! Thank you so much ! Maybe it's a good idea to extend the error message with a description of the error, otherwise we end up in auto mode and ignore anything result: \d+ because it's not relevant most of the time. Something like:

Parse failed with result: 3 -> Stack overflow

Yes that's a good idea. Also, printing the tracked parse errors makes no sense unless it's a normal parse failure ( CHPEG_ERR_PARSE_FAILED = 1)... they will be nonsensical like what we saw here.

mingodad commented 2 years ago

That's true and would call our attention when a significant error code/message appear instead of ignore by default.

ChrisHixon commented 2 years ago

So I think that takes care of part 1 of this issue. I'm thinking of doing a similar thing with REFS to remove the dependence on parse tree nodes. I think it'll actually be more like it was initially implemented but with the undo system that solves other issues. After that's done I think it would be possible to have an option to parse without building a tree.

mingodad commented 2 years ago

That's nice for syntax check only on complex grammars during the development/debug/tuning.

ChrisHixon commented 2 years ago

Have you used any PEG implementation that can do context sensitive things like back-references, or better yet, something like symbol tables, without creating a parse tree? Actually which ones have symbol tables period?

I'm running into snags with my parse-tree-free reference implementation attempt and I think I'll have to abandon it and find a different way to solve the topic issue. Issues are similar to those I've encountered previously, backtracking being the enemy. Using parse tree nodes with undo actions is easier, more elegant, packrat compatible, and I think it'll work well for future context sensitive things, treating the parse tree like a DOM of sorts. Every other approach I've attempted seems to lead to a big mess, is inefficient, or not packrat compatible, etc.

mingodad commented 2 years ago

The only one I've tried was nez but it doesn't work as advertised it's abandoned but it's ideas are interesting. The other one that I'm actively using is peg/leg from Piumarta in this fork https://github.com/mingodad/peg it's reasonable easy to work with it and performance is also good, but for grammar development cpp-peglib and chpeg are the kings right now.

ChrisHixon commented 2 years ago

I believe this commit fixes this issue: https://github.com/ChrisHixon/chpeg/commit/04d2721bb428ad0e4cb2b1d61dfa8037399b61bc

You can now place an {I} on the start rule to get no parse tree or {L} to get one node spanning the entire parse.

The previously 'locked' tree under IGNORE/LEAF is now 'restricted' to only allow nodes through that absolutely require a parse tree node for functionality (currently, only REFSCOPE). Child nodes under {I}/{L} are deleted after the fact.

mingodad commented 2 years ago

Just after reloading the playground and trying the grammar shown bellow (calc) with adding {i} to Expr seems to freeze the browser:

Expr      {I}      <- _ Sum EOF

Sum             <- Product (PLUS_MINUS Product)*
Product         <- Value (MUL_DIV Value)*
Value           <- NUMBER / ParenExpr
ParenExpr       <- LPAREN Sum RPAREN

PLUS_MINUS      <- < [+-] > _
MUL_DIV         <- < [*/] > _
NUMBER          <- < '-'? ("0" / [1-9][0-9]*) > _
LPAREN      {I} <- < '(' > _
RPAREN      {I} <- < ')' > _
_           {I} <- [ \t\r\n]*
EOF             <- !.

Input:

1 + 
    (2 * (3 + 44 / 5) - 6) 
    * 7 + 8 - 9
ChrisHixon commented 2 years ago

It seems to work fine for me in Chromium and Firefox using that grammar/input. Maybe it didn't update for you yet - try reloading page. If still fails, can you give more info, browser version, console errors, etc.

mingodad commented 2 years ago

Loading it locally after git pull I'm getting a frozen browser (chromium) on Linux64:


chpeg-playground.js:1 Uncaught RuntimeError: Aborted(Assertion failed: self->tree_root->num_children == 1, at: ./../include/chpeg_ext.h,3897,ChpegParser_parse). Build with -s ASSERTIONS=1 for more info.
    at abort (chpeg-playground.js:1:9187)
    at ___assert_fail (chpeg-playground.js:1:12748)
    at chpeg-playground.wasm:0x89d2
    at chpeg-playground.wasm:0xa257
    at Module._parse (chpeg-playground.js:1:14476)
    at ccall (chpeg-playground.js:1:4683)
    at chpeg-playground.js:1:5073
    at index.js:123:5
mingodad commented 2 years ago

But after rebuild the playground it works fine, so have you rebuilt the playground ?

ChrisHixon commented 2 years ago

I did and have been using it from https://chrishixon.github.io/chpeg/playground/ I just attempted again and nothing changed so there's nothing to commit/push. It must be GitHub's infrastructure/CDN is not updated for you.

mingodad commented 2 years ago

Yes, probably you are right because I did a refresh again and now it's working as you mention. So need to take this in account for fresh updates on github pages.

mingodad commented 2 years ago

But when I reloaded again the same console error came back, but with my fresh local build it works fine. So I'll try again latter to see if it succeed.

mingodad commented 2 years ago

It was my fault I just was testing on my old playground here https://meimporta.eu/chpeg/ because I have several tabs open and didn't noticed I'm on my outdated playground. Sorry again for the noise ! Thanks again for your great work and help !

ChrisHixon commented 2 years ago

No problem, thanks for your help testing, and finding/reporting bugs! Closing this issue.