cjheath / treetop

A Ruby-based parsing DSL based on parsing expression grammars.
https://cjheath.github.io/treetop
MIT License
306 stars 22 forks source link

Documentation should explain when treetop parser is quadratic #31

Closed cabo closed 5 years ago

cabo commented 6 years ago

When parsing a UTF-8 string, treetop's run time is quadratic with respect to the input size. When parsing BINARY ("ASCII-8BIT"), the run time is mostly linear. That behavior doesn't come up in small test cases and may surprise users.

(The reason is that treetop heavily uses indexing into a String instance, and that is generally linear in UTF-8 and constant-time in BINARY. Maybe there is a way to get the index caching in the Ruby String implementation to work for treetop, but it doesn't work for me.)

cjheath commented 6 years ago

Thank you for letting me know about the problem. I wasn't aware that Ruby strings had index caching. Can you please explore this possible solution further.

I find this frankly rather hilarious, given this discussion I had with Matz in 2002, over 15 years ago now! https://groups.google.com/d/embed/msg/comp.lang.ruby/XcKQSG0MfRk/-WdVw3ZMmLMJ

longhotsummer commented 5 years ago

I'm interested in looking into this, too. Profiling with ruby-prof shows that 50% of parsing time is spent indexing into strings, mostly to compare the substring to a regexp.

$ ruby --version
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-darwin17]
$ bundle list treetop
/Users/greg/.rvm/gems/ruby-2.6.2/gems/treetop-1.6.10
profile html#Array_ _70207986507560 2019-03-27 09-46-06

I hadn't heard about the quadratic issue with UTF-8 strings, which would definitely help to explain this. Other approaches I was considering to help with this:

cjheath commented 5 years ago

I repeat, this is a Ruby problem, not a Treetop one. I told Matz how to fix it SEVENTEEN YEARS AGO, and he still hasn't bothered, so I'm certainly not going to.

longhotsummer commented 5 years ago

@cjheath I'm interested in working together to explore a way to work around this, if possible. I don't know enough about Ruby's strings to know how to tackle the underlying issue. I'm willing to put some effort into this.

While I understand the root cause is Ruby, it does have a significant impact on Treetop users. Would it be possible to keep this issue open to discuss ideas and approaches? I would appreciate your insight into the underlying issue.

cjheath commented 5 years ago

Ok, happy to re-open it, though I'm uncertain how we can fix it.

The issue is that when processing an encoding by counting variable-sized units (UTF-8 chars) there is often a need to count bytes. The "remembered point" technique that I used and requested that Matz implement remembers the total string length (after the first time it has been counted!) but also remembers the offset and character number of the last-accessed position. Because you can scan forwards or backwards in UTF-8 strings, that gives the runtime three different possible starting places (start, last-used, end) and based on those counts, it can decide which is the best place to index from. Very often that will be a small distance from the last-used point, e.g. when you have linear processing in either forward or backward direction. It's not rocket science, but it makes a big difference in total time spent searching.

Bear in mind that the input is not necessarily always a simple String. Because of a design error made very early on, some users (like my activefacts-cql parser) have to use a Delegate or duck-type for the String. Relying on it being a real String might produce breaking change for some users. We can't, for example break the string into N fragments (each fragment being a copy, not a sub-string) - and we shouldn't have to do that anyhow.

Any other suggestions?

longhotsummer commented 5 years ago

Thanks for the details, very interesting.

I think documenting this for users is crucial. I've been frustrated by this performance problem for a few months now, and understanding the issue let me solve it for my use case in a few hours. It has reduced parse times for some documents from 24 seconds to 3 seconds. I have another document that took an hour to parse which I have yet to test.

My use case is parsing plain text UTF-8 documents into XML. In this case, my workaround is to escape non-ASCII chars using URL-style %-encoding, force the string down to ASCII-8bit, parse it, and then un-escape the resulting XML back to UTF-8.

I can do this because my parser only uses ASCII literals and I trust the output that I get back from it. Other use cases may be different by something similar may be possible.

Here's my code:

require 'uri'
# use %-encoding to escape everything outside of the US_ASCII range,
# including encoding % itself.
unsafe = (0..126).to_a - ['%'.ord]
unsafe = unsafe.map { |i| '\u%04x' % i }
unsafe = Regexp.new('[^' + unsafe.join('') + ']')
# escape unsafe chars, this returns a US_ASCII encoded string
text = URI::DEFAULT_PARSER.escape(text, unsafe)

xml = xml_from_tree(@parser.parse(text))

# unescape, returns a UTF-8 encoded string
URI.unescape(xml)
longhotsummer commented 5 years ago

Another option is to memoize the substring lookups. On an example parse I've done, of 417875 substring lookups, 196296 (47%) are unique and the remainder occur more than once and would benefit from memoizing. This could be useful if a grammar has significant backtracking.

A third option is to document that users can provide their own implementation of has_terminal? by including a module into their grammar.

module MyModule
  def has_terminal?(terminal, mode, index)
    # custom implementation
  end
end

grammar Foo
  include MyModule

  rule start
    ...
  end
end

Finally, documenting that users can override their input string's [] method to provide their own substring implementation if necessary.

cjheath commented 5 years ago

You might be able to make it work. But remember, a memo is a String, not the Delegate or whatever is ducking for the input. So care would be needed. Also, has_terminal takes index, which it uses to... index the input. Whatever else it does, it has to look at the text.

cjheath commented 5 years ago

BTW, the "design error" I referred to was in passing around "input" instead of "parser" as a parameter down the descent. input is available from parser, but not vice versa. To add a fourth parameter would have slowed things down (and still been a breaking change), so we left it as-is and I added a "parser" method to my input Delegate.

longhotsummer commented 5 years ago

Using this workaround, a document that previously took an hour to parse, now takes 1 minute 55 seconds.

cjheath commented 5 years ago

Yes, the square law is like that. How big is the document? We should document that Ruby has an issue indexing large UTF-8 strings, but doing much more than that is pointless. I'll happily accept a documentation PR.

longhotsummer commented 5 years ago

The document is 1.2MB. I'll submit a PR.

longhotsummer commented 5 years ago

I did some testing with memoizing the substring lookup and it seems it's not really worth it. For a 100KB file it reduces parse times for my grammar from 16 seconds to 14 seconds. For the same input forced to ascii-encoding, it increased parse times from 3 seconds to 5 seconds.

def text.[](*args)
  @_memo ||= {}
  x = @_memo[args]
  x = @_memo[args] = super if x.nil?
  x
end
cjheath commented 5 years ago

Memoizing might not help, but the same technique could be used to record caller locations and frequency to see if there are any redundant calls being made. Also, we could evaluate whether taking a substring returns a reference to the original string data or a copy, because if it's a reference, we could use that to reduce the length of each index search for the byte offset of a given character. I don't expect to make these experiments anytime soon, but you might get inspired.

longhotsummer commented 5 years ago

38 updates the documentation to talk about this issue. This issue can probably be closed since it deals specifically with the documentation issue.

cjheath commented 5 years ago

I wish I felt motivated enough to make a PR for Ruby's core String support...

kadarmarton commented 4 years ago

I am totally new to treetop and grammars, and all I understand here is some problem with variable length string encoding. So have you thought of making the problematic operation work on UTF-16 strings internally? Maybe I am totally missing the point though. Just free association based on text fragments I caught in your discussion.

cjheath commented 4 years ago

@kadarmarton No that wouldn't help. UTF-16 is still a variable-length encoding, so you can't just index to get characters by position, you need to count. What would help is if Matz had taken the advice I gave him in 2002 about how to implement UTF-8 strings with a "remembered point", which avoids most cases of having to count from the start of the string every time you index. Sigh. Plus ça change, plus c'est la même chose...

cabo commented 8 months ago

So what is missing in Ruby that treetop would need to be efficient?

This doesn't sound like much to add to Ruby.

cjheath commented 8 months ago

@cabo If you think so, I suggest you do it.

Personally I think there's quite a lot more to it, and convincing the community that it's in everyone's interests is even harder.