dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

dgrammar doesn't support newlines #112

Open timotheecour opened 11 years ago

timotheecour commented 11 years ago

when try to parse D source code, I noticed that the parser stops on first newline character. Of course I can replace those by " ", but this hack doesn't work well with newlines inside strings (as it changes semantics), and it just would be nice to parse raw D files.

callumenator commented 11 years ago

I have a fix for this at: 30a55fa209a9dfb8c57473ee39559e91150ee317 if you want to try it out.

@PhilippeSigaud : the problem seems to be related to dmd replacing \r\n in with a single \n in string literals that are mixed in as code, like they are in the keywords rule. I just special-cased this for now. The trie implementation is probably similarly affected.

timotheecour commented 11 years ago

awesome. What's the git command to pull your fix? Thanks.

callumenator commented 11 years ago

to test in a new branch, from your local pegged dir: git checkout -b newline git pull http://github.com/callumenator/Pegged newline

Just FYI I don't think the dgrammar has been thoroughly tested, so it might not parse everything..

PhilippeSigaud commented 11 years ago

Thanks for the hashtag. Indeed, the D grammar is not thoroughly tested. I began to test it 1-2 months ago and then was swept away by a storm of work.

On Mon, Apr 1, 2013 at 9:36 AM, callumenator notifications@github.comwrote:

to test in a new branch, from your local pegged dir: git checkout -b newline git pull http://github.com/callumenator/Peggedhttps://github.com/callumenator/Peggednewline

Just FYI I don't think the dgrammar has been thoroughly tested, so it might not parse everything..

— Reply to this email directly or view it on GitHubhttps://github.com/PhilippeSigaud/Pegged/issues/112#issuecomment-15706702 .

callumenator commented 11 years ago

I was playing with it about an hour ago, and found some problems with templates, static conditionals, and cat expressions, then got depressed by how big it is. It'd be cool to get it working, but some of those chains are so long it takes ages to debug them.

timotheecour commented 11 years ago

it's easier to debug by displaying only the consuming rules (where the begin/end are different from that of its parent); then the chains are much smaller.

unrelated: how could I have guessed the above mentioned git command (branch and url) from the commit id in your first answer (https://github.com/PhilippeSigaud/Pegged/commit/30a55fa209a9dfb8c57473ee39559e91150ee317) ?

callumenator commented 11 years ago

it's easier to debug by displaying only the consuming rules

The trouble I have is figuring out how the grammar is failing when the rules themselves are so deep, like the expression grammar. Plus the dgrammar is slow to compile, debugging is awkward.

how could I have guessed the above mentioned git command

Yeh my bad I didn't realize github showed a commit like that on top of Philippe's repo.

timotheecour commented 11 years ago

Yeh my bad I didn't realize github showed a commit like that on top of Philippe's repo. Ok that was really confusing me...

Plus the dgrammar is slow to compile, debugging is awkward. If you fix the grammar (ie are not tweaking it) and only modify the inputs to it, you can always pre compile both the d grammar and, say, a visiting function that'll print out the AST; so that user code doesn't have to link the grammar parsing code in.

Question to @PhilippeSigaud and @callumenator: can we use std.lexer (see https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d) to tokenize the input and hence speed up parsing? (even with the precompilation, I would wish speed would be higher) That's an important enough use case...

PhilippeSigaud commented 11 years ago

Well, PEG (Parsing Expression Grammars) are made to be scannerless, but I guess they can be modified somewhat to accept a token range as input. I don't know if lexing is the main culprit for the parsing speed, though: building the parse tree is also quite long. Also, the D grammar is not really LALR or LL(1) and as such I'm not sure a parsing generator is adapted to get a fast parser.

After all, the fast D parsers we have were all produced by hand, and not generated automatically.

Sorry for the lack of activity on Pegged for the past 2 months, but I'm really neck-deep in work right now.

On Tue, Apr 2, 2013 at 10:15 AM, Timothee Cour notifications@github.comwrote:

Yeh my bad I didn't realize github showed a commit like that on top of Philippe's repo. Ok that was really confusing me...

Plus the dgrammar is slow to compile, debugging is awkward. If you fix the grammar (ie are not tweaking it) and only modify the inputs to it, you can always pre compile both the d grammar and, say, a visiting function that'll print out the AST; so that user code doesn't have to link the grammar parsing code in.

Question to @PhilippeSigaud https://github.com/PhilippeSigaud and @callumenator https://github.com/callumenator: can we use std.lexer (see https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d) to tokenize the input and hence speed up parsing? (even with the precompilation, I would wish speed would be higher) That's an important enough use case...

— Reply to this email directly or view it on GitHubhttps://github.com/PhilippeSigaud/Pegged/issues/112#issuecomment-15762220 .