erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.79k stars 126 forks source link

Fix star token grammar #203

Closed lucaswiman closed 2 years ago

lucaswiman commented 2 years ago

@lonnen @erikrose Fixes #173. Fixes #171. Fixes #139. This PR fixes two bugs I've found with token grammars:

>>> 'aa'.startswith('a', 2)
False
>>> 'aa'[2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range

Changes

I updated the test of parse errors in token grammars so that it asserts the help message in the exception. I fixed the message in ParseError so it displays something sensible, which fixed the test.

I added a test of * and + for token grammars, which failed. I addressed a TODO, combining OneOrMore, ZeroOrMore and Optional into a single base class, also adding the ability to add a maximum number of matches. I then fixed the bug in the implementation, preventing the parser from running off then end of the tokens list.

Notes

  1. I would be interested in adding the max-matches feature to the underlying rules syntax if you are amenable. This would be similar to regex's a{3,5} syntax (matches between 3 and 5 "a" characters). It's a small amount of code, and would be useful in some cases where fields have a maximum length. In any case, IMO it should at least be available as a user extension. I'd be happy to do that in this PR or in another PR.
  2. I think the bugginess of * and +, which are essential to any nontrivial grammar, is suggestive that token grammars have never actually been used by anyone for any practical purpose. I think this may be because they're very limited, since any matching of string literals needs to be done in the lexer. This seems to be common (e.g. Lark also does this), presumably a layover from the lex/yacc days of parsing. However, for most data formats other than programming languages, that's a questionable decision. In my attempts to use token grammars for parsing a document format called x12 (that also motivated this issue), I've found the inability to match string literals in the grammar extremely limiting. That format has configurable delimiters, which seems to require a lexer, but otherwise the format is something that an "ordinary" parsimonious grammar would be great for. I'll make a separate PR with a proposal for making TokenGrammar more useful in this regard.
  3. Inheritance can have negative performance impacts, since resolving a method needs to traverse the MRO. I think defining OneOrMore, ZeroOrMore and Optional using Optional = partial(Repeated, min=0, max=1) would be more performant at parse time than the current subclassing approach. However, that would prevent users from subclassing them, which might be considered a breaking change. Using partial feels simpler to me, but I'm happy to do whatever is most expedient for getting these fixes released.
lucaswiman commented 2 years ago

This PR has been updated to include @righthandabacus's changes in #148 and my less thorough refactoring has been removed. I also fixed a bug from that PR and added some extra tests / documentation. I'll leave this for a day and merge if there are no requests for changes.