Closed drhagen closed 6 years ago
I gabbed some large JSON-encoded samples did some benchmarks testing on them using examples.json.JsonParsers.value.parse
.
# Chew leading whitespace only: 2.16 s for 1000 zips
# Chew trailing whitespace only: 2.03 s for 1000 zips
# Chew both whitespace sides: 2.12 s for 1000 zips
# Chew leading whitespace only: 21.2 s for 10000 zips
# Chew trailing whitespace only: 19.9 s for 10000 zips
# Chew both whitespace sides: 21.0 s for 1000 zips
# Chew leading whitespace only: 4.95 s for 100 world_banks
# Chew trailing whitespace only: 4.70 s for 100 world_banks
# Chew both whitespace sides: 4.87 s for 100 world_banks
It looks like chewing whitespace from both sides will be about 1% faster than the current methods of only chewing from the front. However, switching to chewing trailing whitespace only would be 5% faster than that. I have suspected that chewing trailing whitespace may be faster because then the whitespace is only chewed once when something like a | b | c
is used, rather than chewed at the start of each alternative. Chewing both sides is slightly faster than the current state because even though it chews from the front, there is never any whitespace to actually chew.
I am going to go with chewing whitespace from both sides because is it (a) cleanly solves the eof
problem and any future bof
problem (b) faster than the current design, (c) not greatly slower than the fastest design, (d) simplifies parsers that combine sections with and without whitespace, and (e) simplifies the internals related to options.parse_method
quite a bit, which I did not mention before.
Given this parser:
This succeeds as expected:
However, this should probably not fail:
It fails because
eof
merely looks to see if the parser is at the end of the source. An extra space before the end causes a failure. Other parsers in theTextParsers
context chew up any leading whitespace before trying to match, which eof does not do. I see three solutions:eof
context sensitive. This would probably mean turningeof
into a functioneof()
because some code would have to run at definition time in order to grab the context. Theneof
would know what whitespace was and chew it up before testing for the end of the input.TextParsers
chew whitespace from both ends. I have not done this because I have suspected that this would cause performance problems by adding an extra regex comparison to every step. But it would almost always stop on the first character, so I should get some actual performance numbers. This would also make combining parsers with different whitespace constraints (like the JSON parser) smoother because it would eliminate the need for manually chewing the whitespace.eof
is affected by this whitespace issue, so ensuring that whitespace is always consumed from the end rather than the beginning should fix it without introducing new problems. Of course, the problem would reemerge if a "start of input" parser was added.