drhagen / parsita

The easiest way to parse text in Python
https://parsita.drhagen.com/
MIT License
98 stars 5 forks source link

Predicate parser drops failures of inner parser when predicate is false #86

Closed drhagen closed 1 year ago

drhagen commented 1 year ago

In Parsita 1, the pred parser discards all errors from a successful run of the inner parser if the predicate turns out to be false. (The predicate failure is the only failure.) This is not how Parsita promises to work. It promises to always report the furthest error, which is usually what the user wants. This matters when one parser parses a prefix of another parser (basically all parsers of mathematical expressions). The short parser succeeds, the long parser fails. This is normally fine because, when the overall parser fails to consume the full input, the long parser's message is displayed because it made more progress before the ultimate failure.

Below is a parser (clearly a subset of a full expression parser) that demostrates this unexpected behavior.

from parsita import *

def is_static(expression):
    if isinstance(expression, str):
        return False
    if isinstance(expression, int):
        return True
    elif isinstance(expression, list):
        return all(is_static(e) for e in expression)

class PredParsers(ParserContext):
    name = reg(r"[a-zA-Z]+")
    integer = reg(r"[0-9]+") > int
    function = name >> "(" >> repsep(expression, ",") << ")"
    expression = function | name | integer
    static_expression = pred(expression, is_static, "static expression")

print(PredParsers.static_expression.parse("1"))
# Success(1)
print(PredParsers.static_expression.parse("sin(0)"))
# Success([0])
print(PredParsers.static_expression.parse("a"))
# Expected static expression but found 'a'
# Line 1, character 1
# a
# ^
print(PredParsers.static_expression.parse("f(a,1)"))
# Expected static expression but found 'f'
# Line 1, character 1
# f(a,1)
# ^
print(PredParsers.static_expression.parse("f(2,1)"))
# Success([2, 1])
print(PredParsers.static_expression.parse("f(2,1"))
# Expected static expression but found 'f'
# Line 1, character 1
# f(2,1
# ^

This last error message shows the problem. The farther error is clearly the issue, but the early success takes precedence. This is easy to fix in Parsita 1; it's just a missing .merge(status):

class PredicateParser(Generic[Input, Output], Parser[Input, Input]):
    def __init__(self, parser: Parser[Input, Output], predicate: Callable[[Output], bool], description: str):
        super().__init__()
        self.parser = parser
        self.predicate = predicate
        self.description = description

    def consume(self, reader: Reader[Input]):
        remainder = reader
        status = self.parser.consume(remainder)
        if isinstance(status, Continue):
            if self.predicate(status.value):
                return status
            else:
                return Backtrack(remainder, lambda: self.description)#.merge(status)
        else:
            return status

Sure enough, this makes pred report the farthest error.

print(PredParsers.static_expression.parse("f(2,1"))
# Expected ',' or ')' but found end of source

However, this now breaks the pred parser error reporting. Because the pred parser fails at the start of the input, the inner parser always makes more progress than pred so any way that the parser could be extended becomes the farther error. The pred parser error message is actually hard to trigger and, in fact, cannot be triggered by static_expressions in the example at all.

print(PredParsers.static_expression.parse("a"))
# Expected '(' but found end of source

This was discovered because Parsita 2 fixes the original bug (because merges no longer need to be manually applied) and I was relying on pred reporting an error at least sometimes. The expected behavior of pred could be this:

There is a question here on whether or not failures that tie the predicate parser should be superseded by the predicate parser. For example,

print(PredParsers.static_expression.parse("a"))
# Expected '(' or static expression but found end of source  # allowing ties
# Expected static expression but found end of source  # superseding ties

For this particular parser, superseding ties is clearly better, but it is difficult to convince myself that superseding ties is always the correct behavior for pred. This would be the only parser that supersedes other failures. While unusual, it kind of makes sense given that pred applies a post-processing step and therefore acts as a kind of checkpoint on the parser.

In the same vein, there is also a question of whether or not the predicate parser should clear any errors generated by the inner parser at the end of the input it consumed.

Ultimately, the pred parser may not be completely sane because it is not really a parser, but a post-processor of parsed input. I have only used the pred parser where I should probably be using a type checker on top of the AST, but that was too much work.