eof does not handle whitespace

drhagen / parsita

The easiest way to parse text in Python

MIT License

98 stars 5 forks source link

Given this parser:

class EofParser(TextParsers):
    a = lit('a') << eof

This succeeds as expected:

>>> Temp.a.parse(' a')
Success('a')

However, this should probably not fail:

>>> Temp.a.parse('a ').or_die()
ParseError: Expected end of source but found ' '
Line 1, character 2

a
 ^

It fails because eof merely looks to see if the parser is at the end of the source. An extra space before the end causes a failure. Other parsers in the TextParsers context chew up any leading whitespace before trying to match, which eof does not do. I see three solutions:

Make eof context sensitive. This would probably mean turning eof into a function eof() because some code would have to run at definition time in order to grab the context. Then eof would know what whitespace was and chew it up before testing for the end of the input.
Make TextParsers chew whitespace from both ends. I have not done this because I have suspected that this would cause performance problems by adding an extra regex comparison to every step. But it would almost always stop on the first character, so I should get some actual performance numbers. This would also make combining parsers with different whitespace constraints (like the JSON parser) smoother because it would eliminate the need for manually chewing the whitespace.
Swap which side whitespace gets chewed from. I think that only eof is affected by this whitespace issue, so ensuring that whitespace is always consumed from the end rather than the beginning should fix it without introducing new problems. Of course, the problem would reemerge if a "start of input" parser was added.

# Chew leading whitespace only: 2.16 s for 1000 zips # Chew trailing whitespace only: 2.03 s for 1000 zips # Chew both whitespace sides: 2.12 s for 1000 zips # Chew leading whitespace only: 21.2 s for 10000 zips # Chew trailing whitespace only: 19.9 s for 10000 zips # Chew both whitespace sides: 21.0 s for 1000 zips # Chew leading whitespace only: 4.95 s for 100 world_banks # Chew trailing whitespace only: 4.70 s for 100 world_banks # Chew both whitespace sides: 4.87 s for 100 world_banks

drhagen / parsita

eof does not handle whitespace #8