erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.8k stars 126 forks source link

How do I capture identifier string sans reserved words #156

Open ravivedula opened 4 years ago

ravivedula commented 4 years ago

Hi,

I am trying to parse a language with a set of reserved words and identifiers, just like C or C++. Now I am running into a strange issue. I have expression rules as following,

identifier = !(reserved) ~r"[a-zA-Z_][a-zA-Z_0-9\?]*" reserved = "if"/ "while" / "case" ... while_expr = "while" "(" identifier ...

Now, when parsing the following text, while ( x = 1 ) whilevar = 10

It is supposed to recognise the sting "whilevar" as an identifier, instead the parser is expecting a "(" after "while" in "whilevar" string based on the "while_expr" rule.

Am I defining the identifier expression rule incorrectly?

Or is there a way to specify precedence to complete the identifier expression rule before it attempts to complete other expression rules?

I have spent quite a bit of effort in defining the whole grammar. It is working fine except this one. I am really struck here, any identifier which starts with a reserved word substring is not getting parsed correctly.

Kindly respond ASAP.

Thanks, Ravi

MrTomKimber commented 4 years ago

I've found myself cheating when faced with this problem - what I do is build a library of reserved keywords (if, while, case, etc) and define some unlikely-to-be-reproduced version of them (ΔIFΔ, ΔWHILEΔ, ΔCASEΔ ) and perform a pre-process global file-replace (ensuring whitespace or some kind of delimiter is in play). I tend to use unicode-Greek characters as a personal preference (and because you're less likely to see these out in the wild) but that choice is up to you.

This way, your reserved keywords are explicitly matched and easily identified without there being (so much) danger of an accidental clash. It's also helpful as I can maintain a separate list of reserved keyworkds/functions outside of the grammar, and maintain that list more easily should it change.

In your example the text the parser would finally read would look like:

ΔWHILEΔ ( x = 1 ) whilevar = 10

Your grammar file for reserved words would be some regex that looks for your reserved pattern - like:

reserved = ~"\Δ[\w]+\Δ"

the end result being there'd be less chance of a clash.

This is probably less elegant a solution than what's possible, but it ended up saving me some time.

I'd like to hear/see any more parser/grammar-centric solutions as they'd feel more pure. But maybe a pre-processor is one way to approach this kind of issue?

erikrose commented 4 years ago

I would take advantage of Parsimonious' infinite lookahead. This is one of the great advantages of PEGs. Further up in your grammar, where you describe the statements or expressions that while or whilevar can be part of, use alternation to try one, and then, if that fails, fall through to the other:

expression = while_expr / assignment

Yes, this will find the "while", expect a "(", but then not find it and backtrack, next trying assignment. That sort of strategy should get you what you want without any preprocessing.

ravivedula commented 4 years ago

Thanks Tom and Erik for the detailed replies. I wish I received these last year. In the language I am trying to parse, while keyword is always post fixed with "(". So changing reserved word from "while" to "while(" worked. identifier = !non_idstr ~"[a-zA-Z][a-zA-Z0-9\?]*" non_idstring = "while(" / "if(" / "for(" / ... while = "while(" cond _ stmts ")"

erikrose commented 4 years ago

Glad you figured out a solution!

beepbopbeepboop commented 1 month ago

For lua: identifier = ~"(?!\b(?:and|break|do|else|elseif|end|false|for|function|goto|if|in|local|nil|not|or|repeat|return|then|true|until|while)\b)[a-zA-Z][a-zA-Z0-9]*"

This makes it absolutely impossible for goto to ever be an identifier, I think.

beepbopbeepboop commented 1 month ago

Indeed, keywords are so pervasive in grammars, there should be a doc entry for how to do it. Indeed, would be nice to have a slightly prettier version of the above. If a new one is added, it would be easy to miss the update to this rule.