Closed cabo closed 5 years ago
Thank you for letting me know about the problem. I wasn't aware that Ruby strings had index caching. Can you please explore this possible solution further.
I find this frankly rather hilarious, given this discussion I had with Matz in 2002, over 15 years ago now! https://groups.google.com/d/embed/msg/comp.lang.ruby/XcKQSG0MfRk/-WdVw3ZMmLMJ
I'm interested in looking into this, too. Profiling with ruby-prof shows that 50% of parsing time is spent indexing into strings, mostly to compare the substring to a regexp.
$ ruby --version
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-darwin17]
$ bundle list treetop
/Users/greg/.rvm/gems/ruby-2.6.2/gems/treetop-1.6.10
I hadn't heard about the quadratic issue with UTF-8 strings, which would definitely help to explain this. Other approaches I was considering to help with this:
regexp.match?(s, offset, length)
)I repeat, this is a Ruby problem, not a Treetop one. I told Matz how to fix it SEVENTEEN YEARS AGO, and he still hasn't bothered, so I'm certainly not going to.
@cjheath I'm interested in working together to explore a way to work around this, if possible. I don't know enough about Ruby's strings to know how to tackle the underlying issue. I'm willing to put some effort into this.
While I understand the root cause is Ruby, it does have a significant impact on Treetop users. Would it be possible to keep this issue open to discuss ideas and approaches? I would appreciate your insight into the underlying issue.
Ok, happy to re-open it, though I'm uncertain how we can fix it.
The issue is that when processing an encoding by counting variable-sized units (UTF-8 chars) there is often a need to count bytes. The "remembered point" technique that I used and requested that Matz implement remembers the total string length (after the first time it has been counted!) but also remembers the offset and character number of the last-accessed position. Because you can scan forwards or backwards in UTF-8 strings, that gives the runtime three different possible starting places (start, last-used, end) and based on those counts, it can decide which is the best place to index from. Very often that will be a small distance from the last-used point, e.g. when you have linear processing in either forward or backward direction. It's not rocket science, but it makes a big difference in total time spent searching.
Bear in mind that the input is not necessarily always a simple String. Because of a design error made very early on, some users (like my activefacts-cql parser) have to use a Delegate or duck-type for the String. Relying on it being a real String might produce breaking change for some users. We can't, for example break the string into N fragments (each fragment being a copy, not a sub-string) - and we shouldn't have to do that anyhow.
Any other suggestions?
Thanks for the details, very interesting.
I think documenting this for users is crucial. I've been frustrated by this performance problem for a few months now, and understanding the issue let me solve it for my use case in a few hours. It has reduced parse times for some documents from 24 seconds to 3 seconds. I have another document that took an hour to parse which I have yet to test.
My use case is parsing plain text UTF-8 documents into XML. In this case, my workaround is to escape non-ASCII chars using URL-style %-encoding, force the string down to ASCII-8bit, parse it, and then un-escape the resulting XML back to UTF-8.
I can do this because my parser only uses ASCII literals and I trust the output that I get back from it. Other use cases may be different by something similar may be possible.
Here's my code:
require 'uri'
# use %-encoding to escape everything outside of the US_ASCII range,
# including encoding % itself.
unsafe = (0..126).to_a - ['%'.ord]
unsafe = unsafe.map { |i| '\u%04x' % i }
unsafe = Regexp.new('[^' + unsafe.join('') + ']')
# escape unsafe chars, this returns a US_ASCII encoded string
text = URI::DEFAULT_PARSER.escape(text, unsafe)
xml = xml_from_tree(@parser.parse(text))
# unescape, returns a UTF-8 encoded string
URI.unescape(xml)
Another option is to memoize the substring lookups. On an example parse I've done, of 417875 substring lookups, 196296 (47%) are unique and the remainder occur more than once and would benefit from memoizing. This could be useful if a grammar has significant backtracking.
A third option is to document that users can provide their own implementation of has_terminal?
by including a module into their grammar.
module MyModule
def has_terminal?(terminal, mode, index)
# custom implementation
end
end
grammar Foo
include MyModule
rule start
...
end
end
Finally, documenting that users can override their input string's []
method to provide their own substring implementation if necessary.
You might be able to make it work. But remember, a memo is a String, not the Delegate or whatever is ducking for the input. So care would be needed. Also, has_terminal takes index, which it uses to... index the input. Whatever else it does, it has to look at the text.
BTW, the "design error" I referred to was in passing around "input" instead of "parser" as a parameter down the descent. input is available from parser, but not vice versa. To add a fourth parameter would have slowed things down (and still been a breaking change), so we left it as-is and I added a "parser" method to my input Delegate.
Using this workaround, a document that previously took an hour to parse, now takes 1 minute 55 seconds.
Yes, the square law is like that. How big is the document? We should document that Ruby has an issue indexing large UTF-8 strings, but doing much more than that is pointless. I'll happily accept a documentation PR.
The document is 1.2MB. I'll submit a PR.
I did some testing with memoizing the substring lookup and it seems it's not really worth it. For a 100KB file it reduces parse times for my grammar from 16 seconds to 14 seconds. For the same input forced to ascii-encoding, it increased parse times from 3 seconds to 5 seconds.
def text.[](*args)
@_memo ||= {}
x = @_memo[args]
x = @_memo[args] = super if x.nil?
x
end
Memoizing might not help, but the same technique could be used to record caller locations and frequency to see if there are any redundant calls being made. Also, we could evaluate whether taking a substring returns a reference to the original string data or a copy, because if it's a reference, we could use that to reduce the length of each index search for the byte offset of a given character. I don't expect to make these experiments anytime soon, but you might get inspired.
I wish I felt motivated enough to make a PR for Ruby's core String support...
I am totally new to treetop and grammars, and all I understand here is some problem with variable length string encoding. So have you thought of making the problematic operation work on UTF-16 strings internally? Maybe I am totally missing the point though. Just free association based on text fragments I caught in your discussion.
@kadarmarton No that wouldn't help. UTF-16 is still a variable-length encoding, so you can't just index to get characters by position, you need to count. What would help is if Matz had taken the advice I gave him in 2002 about how to implement UTF-8 strings with a "remembered point", which avoids most cases of having to count from the start of the string every time you index. Sigh. Plus ça change, plus c'est la même chose...
So what is missing in Ruby that treetop would need to be efficient?
This doesn't sound like much to add to Ruby.
@cabo If you think so, I suggest you do it.
Personally I think there's quite a lot more to it, and convincing the community that it's in everyone's interests is even harder.
When parsing a UTF-8 string, treetop's run time is quadratic with respect to the input size. When parsing BINARY ("ASCII-8BIT"), the run time is mostly linear. That behavior doesn't come up in small test cases and may surprise users.
(The reason is that treetop heavily uses indexing into a String instance, and that is generally linear in UTF-8 and constant-time in BINARY. Maybe there is a way to get the index caching in the Ruby String implementation to work for treetop, but it doesn't work for me.)