Closed lucaswiman closed 2 years ago
This PR has been updated to include @righthandabacus's changes in #148 and my less thorough refactoring has been removed. I also fixed a bug from that PR and added some extra tests / documentation. I'll leave this for a day and merge if there are no requests for changes.
@lonnen @erikrose Fixes #173. Fixes #171. Fixes #139. This PR fixes two bugs I've found with token grammars:
*
and+
just completely do not work due to a bug where they run off the end of the list of tokens leading to anIndexError
. This works for strings because.startswith
returns false for an index beyond the size of the string, e.g.:Changes
I updated the test of parse errors in token grammars so that it asserts the help message in the exception. I fixed the message in
ParseError
so it displays something sensible, which fixed the test.I added a test of
*
and+
for token grammars, which failed. I addressed a TODO, combiningOneOrMore
,ZeroOrMore
andOptional
into a single base class, also adding the ability to add a maximum number of matches. I then fixed the bug in the implementation, preventing the parser from running off then end of the tokens list.Notes
a{3,5}
syntax (matches between 3 and 5 "a" characters). It's a small amount of code, and would be useful in some cases where fields have a maximum length. In any case, IMO it should at least be available as a user extension. I'd be happy to do that in this PR or in another PR.*
and+
, which are essential to any nontrivial grammar, is suggestive that token grammars have never actually been used by anyone for any practical purpose. I think this may be because they're very limited, since any matching of string literals needs to be done in the lexer. This seems to be common (e.g. Lark also does this), presumably a layover from the lex/yacc days of parsing. However, for most data formats other than programming languages, that's a questionable decision. In my attempts to use token grammars for parsing a document format called x12 (that also motivated this issue), I've found the inability to match string literals in the grammar extremely limiting. That format has configurable delimiters, which seems to require a lexer, but otherwise the format is something that an "ordinary" parsimonious grammar would be great for. I'll make a separate PR with a proposal for makingTokenGrammar
more useful in this regard.OneOrMore
,ZeroOrMore
andOptional
usingOptional = partial(Repeated, min=0, max=1)
would be more performant at parse time than the current subclassing approach. However, that would prevent users from subclassing them, which might be considered a breaking change. Usingpartial
feels simpler to me, but I'm happy to do whatever is most expedient for getting these fixes released.