WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
929 stars 122 forks source link

Return found key in Automaton.iter() #82

Closed Pehat closed 1 year ago

Pehat commented 6 years ago

Hello! Thank you for the great library. I need to search multiple keys stored in the automaton in the input string. The number of keys is big and they overlap significantly, so the automaton greatly reduces memory consumption. Anyway, I need to get all (position, key) entries from the string. The documentation suggests to store keys in the associated value, but this approach nullifies all the benefits from storing all the keys in automaton. Is there any possibility to get found item's key in Automaton().iter()?

WojciechMula commented 6 years ago

AhoCorasick automaton is a simple DFA, thus during visiting nodes we would need to save somewhere the whole path which lead to the current, accepting state. At first glance it seems pretty complicated, but I haven't had time to check the idea in details. Maybe you, or somebody else, can point any paper which describes such application?

frankier commented 5 years ago

If you use STORE_LENGTH you can combine the end position and length to get the portion of the haystack to slice.

EDIT: Although possibly a convenience method which does this would be nice.

Pehat commented 5 years ago

AhoCorasick automaton is a simple DFA, thus during visiting nodes we would need to save somewhere the whole path which lead to the current, accepting state. At first glance it seems pretty complicated, but I haven't had time to check the idea in details. Maybe you, or somebody else, can point any paper which describes such application?

I have briefly read Wikipedia article on Aho-Corasick. If I understand right, your DFA looks something like this: Aho-Courasick DFA

This looks like a special case of DFA - every node in it has its own parent. Therefore, if DFA is in accepting state, we can trace back the whole path up to the root and assembly the key (is reversed order) by simply following parent links. Please tell me if I make a mistake here.

frankier commented 5 years ago

Hi Pehat,

Why can't you trace back in the string being searched, like I suggested?

Pehat commented 5 years ago

Hi Pehat,

Why can't you trace back in the string being searched, like I suggested?

Hi frankier,

I need to store associated Python objects in the trie, so I can't use this option.

frankier commented 5 years ago

How about adding a tuple (length, current_value) to the Trie. It should use less space than storing the input string too. Does it solve your space problem?

WojciechMula commented 5 years ago

@Pehat I'm not sure if simple backtrace would do the job, because an accepting node might be reached via different paths. This is why I suggested that the matched string must be build during DFA walking.

@frankier's solution seems OK.

WojciechMula commented 5 years ago

I was thinking about this and to my knowledge it's simply not possible without extra memory overhead.

When you are at given node and going down the trie, you're simply appending letter to words. But if you need to go back via fail links you must remove k tailing letters from already collected string. The value k is unknown, it must be saved alongside fail link during automaton construction.

Another option (IMO easier and more reliable) would be allow to traverse from any node back to root node. But this require to store pointer from each node to its parent.

WojciechMula commented 5 years ago

So, seems to be doable at C level. But I don't want to increase memory consumption, as the structure is already big. I'd rather shrink nodes.

An option would be enable this feature using pre-processor definition, but then it must be somehow resolved on PIP (don't know how, any ideas?)

kootenpv commented 5 years ago

I'm not sure if it is possible to use less memory if using C, but I could also imagine that in CPython we could do:

a.add_word("some_word", ("some_word", "some_normalized_value"))

Maybe the lib could take care of doing this implicitly with just , returning (end_index, key, value) behind the scenes, or even (start, end, key, value).

Actually, I find myself adding the length using:

word = "word"
automaton.add_word(word", (len(word), "value"))

So that at match time, and I get (end_index, (length, value)), I can do:

end_index - length

To receive the starting position, and to enable extraction of the match.

All the postprocessing I do make using this library 2x slower for me (and I'm using as optimized CPython as I can make it), trading speed for usability.

I wish I knew how to write C so I could help :(

pombredanne commented 1 year ago

I think @frankier approach is the simplest. I use it too in https://github.com/nexB/scancode-toolkit/blob/0aa964e00d9655f27049e958813d1789edbcf13d/src/licensedcode/match_aho.py#L64

@Pehat you wrote:

The documentation suggests to store keys in the associated value, but this approach nullifies all the benefits from storing all the keys in automaton.

then you wrote:

I need to store associated Python objects in the trie, so I can't use this option.

These two statements may be conflicting? IMHO since you are already storing a lot, then just store the extra indexes too and reconstruct the original strings this way. Use a tuple or a list of tuples or else. Just store your objects and the pointers you need to get back the keys.

I am closing this now, but please reopen if you feel strongly about it!