WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
929 stars 122 forks source link

Add TokenAutomaton and extend iterator with methods to make it usable as a pointer like abstraction from Python #89

Closed frankier closed 5 years ago

frankier commented 5 years ago

So this is kind of a bad PR because it's two in one, but I developed them at the same time and there is a small amount of interdependence. If you want one thing but not the other, I can create a new PR. (I can also add more proper unit tests/docs at the same time -- after discussion.)

The TokenAutomaton is essentially the idea mentioned in https://github.com/WojciechMula/pyahocorasick/issues/27 . It's mostly just a convenience wrapper, hence it is written in Python. I don't think it should be too slow after building since it really just adds an extra dict lookup per token.

The second bit is kind of two parts:

  1. I want to be able to search through a haystack where each token may be one of multiple tokens. This automaton structure is called a confusion network. See: http://www.statmt.org/moses/img/mesh.png . I am dealing with the unweighted case. This is implemented in conf_net_search.
  2. To do this I had to enrich AutomatonSearchIter with some extra stuff to make it so it's more directly usable like a pointer. These could be generally useful for doing other things (most likely combining with different types of automaton).

See the code and docstrings within for more info.

Here is the code I've been using for testing:

import ahocorasick
from pyahocorasick import conf_net_search, TokenAutomaton
from copy import copy

a = ahocorasick.Automaton(ahocorasick.STORE_ANY, ahocorasick.KEY_SEQUENCE)
a.add_word((43, 89), (43, 89))
a.add_word((43, 89, 64), (43, 89, 64))
a.add_word((89, 64), (89, 64))
a.add_word((89, 100), (89, 100))
a.make_automaton()

for x in a.iter((80, 80, 43, 89, 90, 89, 64, 100, 43, 89, 100)):
    print(x)

confs = ((80,43),(80, 89),(64,43),(89,),(90,43),(89,),(64,100),(100,),(43,),(89,),(100,64))

print("confs")
for x in conf_net_search(a, confs):
    print(x)

a2 = TokenAutomaton()
a2.add_word("big pig fig".split(), "bpf")
a2.add_word("pig fig chig".split(), "pfc")
a2.add_word("big pig".split(), "bp")
a2.add_word("big fig".split(), "bf")
a2.make_automaton()

print(a2.index.index)
print(a2.index.tokens)

for x in a2.iter("big pig chig pig fig chig big pig".split()):
    print(x)

print("confs2")
confs2 = (("big", "pig"), ("fig",), ("chig", "flig", "drig", "pig"), ("fig",), ("chig", "wig"))
for x in conf_net_search(a2, confs2):
    print(x)
frankier commented 5 years ago

(Removed some trace and unused index -> token map/list)

frankier commented 5 years ago

I don't think the confusion network search is up to scratch with the rest of this patch (it's not entirely clear how best to do it). I'm removing it from the PR and leaving my current version in this comment as a motivating example of treating the iter object as a pointer. (I'm also interested if anyone happens to know a better way of doing this.)

from copy import copy

def conf_net_search(auto, conf_net, elem_id_fn=lambda x: x):
    """
    Searches a confusion network (encoded as an iterable of iterables) with an
    Aho-Corasick Automaton (ACA). It does this by keeping several pointers into
    the ACA. Pointer uniqueness is maintained.

    Theoretically, we can remove dominated nodes, which are redundant, Given
    some pointer which has a some route r_1 from the start node, it is
    dominated by a pointer with route r_2 from the start node if r_1 is a
    suffix of r_2 and r_2 is longer than r_1. So if we have pointers and routes
    like so:

    start->a->b->c->pointer 1
    start->b->c->pointer 2
    start->c->pointer 3

    Then pointers 2 and 3 are dominated by pointer 1 and pointer 3 is dominated
    by pointer 2. This means that all pointers apart from pointer 1 are
    redundant.

    Currently, this isn't fully utilised. Instead, the root is removed if there
    are any other pointers, which is the trivial example of this case.
    """
    root = auto.iter(())
    root_id = root.pos_id()
    auto_its = [root]

    for opts in conf_net:
        # Don't add the root pointer to begin with
        seen_auto_its = {root_id}
        next_auto_its = []
        # We can get duplicates with the current scheme, so filter
        elem_ids = set()
        elems = []
        # Save the current root to ensure the right character index
        cur_root = None
        for auto_it in auto_its:
            for opt in opts:
                new_auto_it = copy(auto_it)
                new_auto_it.set((opt,))
                for elem in new_auto_it:
                    if new_auto_it.pos_id() in next_auto_its:
                        break
                    elem_id = elem_id_fn(elem)
                    if elem_id not in elem_ids:
                        elem_ids.add(elem_id)
                        elems.append(elem)
                if new_auto_it.pos_id() not in seen_auto_its:
                    seen_auto_its.add(new_auto_it.pos_id())
                    next_auto_its.append(new_auto_it)
                elif new_auto_it.pos_id() == root_id:
                    cur_root = new_auto_it
        for elem in elems:
            yield elem
        # If we end up with nothing, add back the root
        if len(next_auto_its) == 0:
            next_auto_its.append(cur_root)
        auto_its = next_auto_its
WojciechMula commented 5 years ago

@frankier Thanks a lot for these two PRs!

IMO it would be better to make two separate PR and merge them one after one. For the first PR (#88) I'd love to see some unittests. :) Once you split it, I bet we might move forward quickly with this PR.

I'm thinking also about the #90, does it interfere somehow with these two PRs?

frankier commented 5 years ago

Hi. This PR doesn't conflict with #88 or #90, and I think #88 already has unit tests? I'll close this for now and resubmit cleaned up patches (with docs and tests) in stages later.