drhagen / parsita

The easiest way to parse text in Python
https://parsita.drhagen.com/
MIT License
98 stars 5 forks source link

eof does not handle whitespace #8

Closed drhagen closed 6 years ago

drhagen commented 6 years ago

Given this parser:

class EofParser(TextParsers):
    a = lit('a') << eof

This succeeds as expected:

>>> Temp.a.parse(' a')
Success('a')

However, this should probably not fail:

>>> Temp.a.parse('a ').or_die()
ParseError: Expected end of source but found ' '
Line 1, character 2

a
 ^

It fails because eof merely looks to see if the parser is at the end of the source. An extra space before the end causes a failure. Other parsers in the TextParsers context chew up any leading whitespace before trying to match, which eof does not do. I see three solutions:

drhagen commented 6 years ago

I gabbed some large JSON-encoded samples did some benchmarks testing on them using examples.json.JsonParsers.value.parse.

# Chew leading whitespace only: 2.16 s for 1000 zips
# Chew trailing whitespace only: 2.03 s for 1000 zips
# Chew both whitespace sides: 2.12 s for 1000 zips

# Chew leading whitespace only: 21.2 s for 10000 zips
# Chew trailing whitespace only: 19.9 s for 10000 zips
# Chew both whitespace sides: 21.0 s for 1000 zips

# Chew leading whitespace only: 4.95 s for 100 world_banks
# Chew trailing whitespace only: 4.70 s for 100 world_banks
# Chew both whitespace sides: 4.87 s for 100 world_banks

It looks like chewing whitespace from both sides will be about 1% faster than the current methods of only chewing from the front. However, switching to chewing trailing whitespace only would be 5% faster than that. I have suspected that chewing trailing whitespace may be faster because then the whitespace is only chewed once when something like a | b | c is used, rather than chewed at the start of each alternative. Chewing both sides is slightly faster than the current state because even though it chews from the front, there is never any whitespace to actually chew.

I am going to go with chewing whitespace from both sides because is it (a) cleanly solves the eof problem and any future bof problem (b) faster than the current design, (c) not greatly slower than the fastest design, (d) simplifies parsers that combine sections with and without whitespace, and (e) simplifies the internals related to options.parse_method quite a bit, which I did not mention before.