Closed frankier closed 5 years ago
(Removed some trace and unused index -> token map/list)
I don't think the confusion network search is up to scratch with the rest of this patch (it's not entirely clear how best to do it). I'm removing it from the PR and leaving my current version in this comment as a motivating example of treating the iter object as a pointer. (I'm also interested if anyone happens to know a better way of doing this.)
from copy import copy
def conf_net_search(auto, conf_net, elem_id_fn=lambda x: x):
"""
Searches a confusion network (encoded as an iterable of iterables) with an
Aho-Corasick Automaton (ACA). It does this by keeping several pointers into
the ACA. Pointer uniqueness is maintained.
Theoretically, we can remove dominated nodes, which are redundant, Given
some pointer which has a some route r_1 from the start node, it is
dominated by a pointer with route r_2 from the start node if r_1 is a
suffix of r_2 and r_2 is longer than r_1. So if we have pointers and routes
like so:
start->a->b->c->pointer 1
start->b->c->pointer 2
start->c->pointer 3
Then pointers 2 and 3 are dominated by pointer 1 and pointer 3 is dominated
by pointer 2. This means that all pointers apart from pointer 1 are
redundant.
Currently, this isn't fully utilised. Instead, the root is removed if there
are any other pointers, which is the trivial example of this case.
"""
root = auto.iter(())
root_id = root.pos_id()
auto_its = [root]
for opts in conf_net:
# Don't add the root pointer to begin with
seen_auto_its = {root_id}
next_auto_its = []
# We can get duplicates with the current scheme, so filter
elem_ids = set()
elems = []
# Save the current root to ensure the right character index
cur_root = None
for auto_it in auto_its:
for opt in opts:
new_auto_it = copy(auto_it)
new_auto_it.set((opt,))
for elem in new_auto_it:
if new_auto_it.pos_id() in next_auto_its:
break
elem_id = elem_id_fn(elem)
if elem_id not in elem_ids:
elem_ids.add(elem_id)
elems.append(elem)
if new_auto_it.pos_id() not in seen_auto_its:
seen_auto_its.add(new_auto_it.pos_id())
next_auto_its.append(new_auto_it)
elif new_auto_it.pos_id() == root_id:
cur_root = new_auto_it
for elem in elems:
yield elem
# If we end up with nothing, add back the root
if len(next_auto_its) == 0:
next_auto_its.append(cur_root)
auto_its = next_auto_its
@frankier Thanks a lot for these two PRs!
IMO it would be better to make two separate PR and merge them one after one. For the first PR (#88) I'd love to see some unittests. :) Once you split it, I bet we might move forward quickly with this PR.
I'm thinking also about the #90, does it interfere somehow with these two PRs?
Hi. This PR doesn't conflict with #88 or #90, and I think #88 already has unit tests? I'll close this for now and resubmit cleaned up patches (with docs and tests) in stages later.
So this is kind of a bad PR because it's two in one, but I developed them at the same time and there is a small amount of interdependence. If you want one thing but not the other, I can create a new PR. (I can also add more proper unit tests/docs at the same time -- after discussion.)
The TokenAutomaton is essentially the idea mentioned in https://github.com/WojciechMula/pyahocorasick/issues/27 . It's mostly just a convenience wrapper, hence it is written in Python. I don't think it should be too slow after building since it really just adds an extra dict lookup per token.
The second bit is kind of two parts:
conf_net_search
.See the code and docstrings within for more info.
Here is the code I've been using for testing: