andrewcooke / ParserCombinator.jl

A parser combinator library for Julia
Other
107 stars 20 forks source link

Trie-based matcher for `Alt(Equals.(a_long_list)...)` #20

Open oxinabox opened 7 years ago

oxinabox commented 7 years ago

Hi, In the grammar I am writing, there are a few places where I need to match against one of a large selection of constants. Which I will call a_long_list, it might have 20 elements, it might have 200. In theory, I'm sure there are use cases for matching against one of thousands, or tens of thousands.

Alt(Equals.(a_long_list)...) works. But I understand that it will be O(n*m) for n the length of the list, and m the maximum length of any element of that list Which honestly isn't too bad, I think.

But I figure it can be done better. If a Trie is used, I think this becomes just O(m). I might be screwing up my math here, but I think that in the process of finding the longest match, one automatically finds all the shorter matches, which can be saved for if backoff is required. So there is no need to re-step through the source, if one fails.

I've started working on this, I have the Trie stuff working to return what strings match, I just need to like it into a matcher, with the trampoline stuffs, Success/Fail/Execute.

What I currently propose is a matcher:

EqualsOneOf(values; greedy::Bool=true)

Where values are the values that it could be equal to.
If greedy is false then this matches shortest-first. And if it has to back-off, then gives the second shortest string in values that matches, etc. This is the natural order for a Trie.

If greedy is true, then it matches longest-first (the longest string that is in values first), and then if that needs to be backedoff from, match's the second longest, and so forth. The is accomplished by collecting, all the values that match as Trie keys, and then reverseing the order. (Which i guess does make this O(n+m))

This is a bit less expressive than Alt(Equals.(a_long_list)...) since that lets you choose a priority for the matchers, not just longestfirst or shortestfirst

What do you think? When I have it working, should I make a PR?

One issue is that right now, Tries only support strings. https://github.com/JuliaLang/DataStructures.jl/issues/220

andrewcooke commented 7 years ago

hi. i think that makes sense. if tries can only support strings then the matcher would need to have a more restricted type, but i don't see why it wouldn't still work for strings (but honestly it's been ages since i last used this library, or even julia). if you submit a pr i will certainly try to include it (but can't promise a fast response).

oxinabox commented 7 years ago

Finally came back to this yesterday, and finished my prototype. It took me a long time to get my head around your code. (Things you call iter I would call iter_state)

Prototype is at https://github.com/oxinabox/ColoringNames.jl/blob/master/proto/TrieMatchers.ipynb

I want to make it lazy in the case of shortest first matching. But that means not using Task based generator expressions.

I probably won't poke this code again for another month or so.

Right now it is very tied to strings. And even onces the requirement on Tries is relaxed, it is going to take some work to make if functional for things that are not index/view-able.