WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

Is there a way to somehow group characters into one node? #106

Closed kootenpv closed 2 years ago

kootenpv commented 5 years ago

I've been thinking about this for the case in which I want to find word endings, or if I want to enable a very simple numerical pattern.

I thought to group 0-9 as a single character. I mean, I could sacrifice a small amount of memory but having a simple numerical pattern if we were to say:

replace 0-9 in both needle and haystack with 0, so to use add_word like:

a = pyahocorasick.Automaton()
a.add_word("num_postfix_0")      # to match a number with no additional cost
a.add_word("num_postfix_00")     # to match numbers with length 2
...
a.add_word("num_postfix_00000")  # to match numbers with length 5

For numerical patterns up to length 10 this would then only cost me 10x memory (which is tbh very acceptable compared to 10**10 if I wouldn't group them). Basically looking for a way to only require 1 node in the trie rather than 10 to represent 0-9.

However, I obviously cannot do this replacement in a fast way in CPython land.

This would enable some nice things, and also help me with the word boundaries if I were able to "group characters".

Can you imagine a way to "group a bunch of characters" in the trie dict? I thought maybe an additional layer of a dict that would check group membership... but I don't know.

In case this illustration helps:

group_dict = {"0": "0", "1": "0", "2": "0"} # all numbers become "0"
for x in 'abc12':
    x = group_dict.get(x, x)  # fallback if not part of a group
    # usual trie matching

EDIT: I guess we could replace all numbers with 0 to at least not have to give up *.

WojciechMula commented 5 years ago

Hi, you can use list of integers instead of strings. You just need to transliterate (transcode) your strings in a way that numbers represent either a concrete character or class. Documentation about this feature is almost non-existing... sorry for that, it will be improved.

Here is a sample code illustrating the idea:

import ahocorasick

DIGIT = 1000
PUNCT = 2000

def transliterate(s):
    r = []
    for c in s:
        if c in '0123456789':
            r.append(DIGIT)
        elif c in ',.!?':
            r.append(PUNCT)
        else:
            # assume ord(c) != DIGIT and PUNCT
            r.append(ord(c))

    return tuple(r)

A = ahocorasick.Automaton(ahocorasick.STORE_ANY, ahocorasick.KEY_SEQUENCE)

A.add_word(transliterate("cat5"), True);
A.add_word(transliterate("dog!"), True);
A.make_automaton()

s = "user cat7 told user dog, that cat3 is a dog."

for item in A.iter(transliterate(s)):
    index, _ = item
    print(s)
    print(' ' * index + '^')

And output:

user cat7 told user dog, that cat3 is a dog.
        ^
user cat7 told user dog, that cat3 is a dog.
                       ^
user cat7 told user dog, that cat3 is a dog.
                                 ^
user cat7 told user dog, that cat3 is a dog.
                                           ^

Is this what you meant?

kootenpv commented 5 years ago

@WojciechMula Thanks a lot for the example! It's indeed a complete example of a solution.

But now the thing is of course that transliterating using CPython code like that will bomb the performance... so I was wondering if there is a way we can use the C speed for that :D?

WojciechMula commented 5 years ago

Then your transliteration code should be written in C :) And I'm afraid it cannot be handled inside module, as everybody would like to have different schemas of transliterations.

Have you tried pyp (https://pypy.org/) or cython (https://cython.org/)? I have no experience with neither of them, but have heard that they can boost scripts significantly.

pombredanne commented 2 years ago

@kootenpv I am closing this now as I think this was answered more than nicely. Feel free to reopen if you still have an issue. Thank you!