WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

Feature request: do not allow certain characters as boundaries #95

Closed kootenpv closed 2 years ago

kootenpv commented 5 years ago

I'd like it to pretty much mimic what you would get with regex as \b.

Is there a way we can easily achieve this currently?

Would it be possible to specify some "disallowed" or "forced" required characters?

WojciechMula commented 5 years ago

There's no such option. You cannot specify which characters are disallowed around matched substring.

Workaround might be add all combinations of word + valid_suffix and valid_prefix + word into automaton.

But I guess it'd be easier to post-process results.

kootenpv commented 5 years ago

I'm doing the postprocessing now, but I still get many hits whereas I suspect that offloading this to C will make it much much faster.

I also already tried adding all the combinations, but the problem is that with all the possible combinations pre and post, I ended up with like a multiplication factor of 35*35 which is 1225 - this does not allow it to scale memory-wise for the automaton (and even made it take a couple of seconds to create it initially).

I was using this set:

>>> PUNC = set(string.punctuation + "\n" + "\t" + " ")
>>> PUNC
{'\t', '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}
>>> for p1 in PUNC:
>>>     for p2 in PUNC:
>>>         for word in words:
>>>             automaton.add_word(p1 + word + p2)

Note that the memory offset gives me about a 30% speed boost, but I feel this could easily be gained from just yielding only valid matches from C.

austrevicius commented 4 years ago

Hello, I also have a question. All the examples of search that I find are with single words. If I have a string, that has multiple words, for example "Green Street Cafe", is it possible to search this whole string vs. the individual words that are in the string? I did pad with spaces for a keyword tuple, but it still seems to return me some unwanted results, like a word "for". Thank you!

WojciechMula commented 4 years ago

@austrevicius I'm not sure if I get your question correctly. An automaton searches only for words it knows, so the word "for" must have been added earlier. Likewise, if you add to the automaton the sentence "Green Street Cafe", such sentence will be matched

Seems the docs are a bit misleading, we use the term "word", while it should be "substring". Automaton doesn't know anything about character classification (i.e. white space, letter, etc.), so any string is valid as "word".

bch80 commented 4 years ago

I'd second @kootenpv For the current application, I'd like to avoid finding substrings. "The apple is red" app should not be found. apple should be found.

Maybe this would be possible with a simple parameter in the iter method?

anakin87 commented 4 years ago

I think that post-processing result can sometimes produce errors.

For example, if the index contains the words Giorgia e di Giorgi and the string is La posizione di Giorgia è critica, iter_long matchs di Giorgi, which must be excluded since it's not a whole word; Giorgia is never found.

What do you think about? Is there a workaround to solve this issue?

Dobatymo commented 4 years ago

I am not sure Aho–Corasick is the correct approach for this type of problem. Usually you would go with the common full-text search approach of using a tokenizer and an inverted index imo.

pombredanne commented 2 years ago

For the current application, I'd like to avoid finding substrings. "The apple is red" app should not be found. apple should be found.

Maybe this would be possible with a simple parameter in the iter method?

This has been implemented with the iter_long method ... See https://pyahocorasick.readthedocs.io/en/latest/#iter-long-string-start-end

Closing now! Thank you all :heart:

kootenpv commented 2 years ago

While iter_long is a nice addition (thank you!) I don't see how it helps with ensuring word boundaries?