Lexical state support - Githubissues

ccleve commented 3 years ago

JFlex has nice support for controlling lexical state. I assume that Flex does as well. In JFlex you call yybegin(int state) to start a new state, and then any rules that are wrapped by the state will get invoked:

%%

{ {myrule} { only gets recognized if state==MYSTATE } } Go here and search for "lexical state": https://jflex.de/manual.html This is really useful. Does PEG or Packcc have a similar concept?

arithy commented 3 years ago

"Lexical state" does not fit in PEG parsing due to the reasons below:

PEG parsers do simultaneously both lexical analysis and syntactic analysis (in a narrow sense). These analysis phases are not distinguished.
They are not done by text scanning in order, but done by backtracking globally out of order. So, "lexical state" cannot be defined with clarity.

arithy commented 3 years ago

@ccleve, if you don't approve the reasons above, can you share more concrete example with me?

ccleve commented 3 years ago

Here's a contrived example. There's a query language for a full-text search engine. It can do queries like this:

category="Books" and text contains "harry and potter" and color="Black"

In this case, we want to recognize "and" as a key word, except when it appears inside a quoted string. The normal way to handle this is to recognize the quoted string as a whole. But for a search engine, you actually have to tokenize what's in the quotes. So, outside the quotes, "and" should return an AND token, and inside the quotes it should return a WORD token.

In the past, when I used JFlex, I flipped into a QUOTED_STRING_STATE when I hit the first quote. Inside that state only WORD tokens are recognized. I flipped back to the default state on the second quote.

Another example: it's helpful to use the same lexer for both documents and queries. In document model, we recognize words. In query mode, we recognize words, keywords, equals and parentheses. It's really helpful to be able to flip a switch and just not recognize some things.

arithy commented 3 years ago

I understand what you want to do. As I told above, a PEG parser is never a lexer which scans text in order. To do it using PEG, which is processed out of order, you should do as below:

...

string <- ["] word_list ["] / ["] ["]

word_list <- WORD space+ word_list / WORD

WORD <- ( '\\' ( space / . ) / !space !["] . )+

...

AND <- "and"

space <- blank / end_of_line
blank <- [ \t\v\f]
end_of_line <- '\r\n' / '\n' / '\r'

arithy commented 3 years ago

@ccleve, I'll close this issue since your request is technically incompatible with PEG parsers.

arithy / packcc

Lexical state support #49