Suggestion: Sort literal tokens by literal length

RodneyMKay commented 1 year ago

With the current behavior, the following code throws an unexpected eof error:

val testGrammar = object : Grammar<TokenMatch> {
    val single by literalToken("<")
    val double by literalToken("<<")

    override val root = single or double
}

testGrammar.parse("<<")

In this example, the second token is always unreachable. This can easily be solved by the user by swapping the definitions of single and double like so, since the lexer basically goes through the literal tokens in the order they're defined in at the moment:

val testGrammar = object : Grammar<TokenMatch> {
    val double by literalToken("<<")
    val single by literalToken("<")

    override val root = single or double
}

testGrammar.parse("<<")

My proposal would be to sort the literal tokens by their length before attempting to parse them. When the lexer attempts to parse longer tokens first, this will always lead to the correct result and the user can no longer define literal tokens that are unreachable in lexing.

I'm not sure you intended for users to be able to control the parse order by the order of the definitions. One negative aspect of this suggestion I thought about is that it could lead to performance overhead, since users can't put more frequently occurring tokens at the top of their grammar to speed up the matching process for those tokens. A compromise here could be a more elaborate algorithm that moves unreachable tokens in front of their reachable counterparts when the lexer is initialized. This still lets people maintain control over most of the parse order and just eliminates the unreachable tokens.

The ultimate decision is not straight forward, but I personally think that the benefits for usability for people new to the library outweigh the drawbacks for experienced users.

alllex commented 1 year ago

This is a valid problem. Though, I think sorting of only literal tokens is only partial solution, if at all.

If the grammars consisted solely of the literal tokens this would probably work. However, just by bringing regex-based tokens into play, the length looses its meaning as a priority measure. Furthermore, Parsus allows to define algorithmic tokens (hand-written token parsers) that are defined by arbitrary code.

In more practical sense, accidentally defining a regex-based token regexToken("[a-z]+") before literal tokens literalToken("oops") that would match this regex leads to the same problem. A similar thing can happen with a pair of regex-based tokens.

Unreachable tokens are most likely a signal of a bug in the grammar definition. There is usually no need to declare redundant tokens. It should be more helpful to the user to point out the unreachable tokens, if they can be detected.

I was thinking about this, and the current solution I have in mind is to do a best-effort validation of the grammar on initialization, failing if there are unreachable tokens. This validation would be enabled by default with an ability to opt-out, just in case. But also, the token-creating methods should take an extra parameter allowing to override their priority without forcing users to rearrange class members. Effectively, users would be able to define the sorting order themselves.

This is very similar to what you describe as a more "elaborate algorithm". The algorithm would stay simple, to be straightforward to the users, but the users would get more control.

alllex commented 1 year ago

Another approach would be to let the parser dictate the tokenization.

This means that there will be no separate process of tokenization, lazy or not. Instead, we would just try to parse whatever token the parser expects next.

This way, users would be able to control priority on the use-site. In your example, it would be by putting the double before the single:

override val root = double or single

alllex commented 1 year ago

Implementing the latter approach proved to be easier than expected. Though, without performance considerations.

Given the other benefits, I am seriously considering switching to it for the mainline.

https://github.com/alllex/parsus/pull/23

RodneyMKay commented 1 year ago

That's actually really cool! I thought about solving the issue this way as well, but figured that was way outside the scope of this library. Well, turns out I was wrong. Thank you very much for your work! ♥️

PS: I'ma close this issue as I'd consider it solved (:

alllex / parsus

Suggestion: Sort literal tokens by literal length #21