WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

Library usage questions... #104

Open tflorac opened 5 years ago

tflorac commented 5 years ago

Hi,

I'm just starting to discover and use your library to extract terms defined in a thesaurus from an input text, and "highlight" them in a HTML output. It works quite well and quickly, anyway I still have a few issues:

Many thanks for any advise!

Best regards, Thierry

WojciechMula commented 5 years ago

Hi Thierry, glad to hear you found pyahocorasick module useful.

There's no way to limit number of hits, you have to do this in your python code. But it seems to be an interesting feature. Would you open a feature request for this?

Automaton will always return all occurrences of strings from the set, even when they overlap. I think the approach from #21 is what you need, but it's not ready. Seems I'll need to back to this, you're another person who requests this.

tflorac commented 5 years ago

Thanks for your reply! I will open another ticket for a feature request!! And if you need help for testing, just ask! ;)

WojciechMula commented 5 years ago

@tflorac thank you, I always need help, as my day usually has 24h only :)

tflorac commented 5 years ago

I'll complete my initial question with another one... I found a solution to only get the first found occurrence of a string, and I'm now looking for a solution to handle my second problem: how to handle use case where a term overlaps another one! It seems that when overlap occurs at the end of the longest string (for example, with the terms "Python" and "Advanced Python"), the longest one (here "Advanced Python") appears first, but with the same position as the second one ; so I think that I can easilly handle this case by remembering the last position at which I already found a term. Can you confirm this point? The problem persists when two terms overlap at the start ("Python" and "Python Book" for example); is there any rule in this case which defines the order in which found terms will be returned?

WojciechMula commented 5 years ago

@tflorac I'm afraid there is no such rule. The automaton work this way that if it matches the last character of some word, it also can figure out other words with the same suffix. But there's no notion of "the first character of match".

BTW, what should be highlighted if you have set ["Advanced Python", "Python Book"] and the string is "Advanced Python Book"? There is partial overlap.

tflorac commented 5 years ago

@WojciechMula: well... so at first I'm going to follow a simple process where if two terms (or more) match at the same position, the first one will win and will be selected!

I don't have any rule actually to answer to your question in the use case you give. I can't define any priority between the three terms, except by giving a "weight" to each term (statically or, for example, based on their respective length)... :-/

kootenpv commented 5 years ago

I also came here to request exactly this... I think the functionality "keep the longest match" will be very well received :)

@WojciechMula Oh and thanks for being so quick on responding to people's requests, I'm having a great time. I'm building a library on top of this: I pretty much finished it but only now realized the overlap "issue" :(

In my case I want to add 2 strings and count on the fact that only the longest one will be kept.

kootenpv commented 5 years ago

I think this works as python logic, but it would be awesome if we could get the speed from c as I think this will be really slow:

from ahocorasick import Automaton

a = Automaton()

a.add_word("no", "no")
a.add_word("no thanks", "no thanks")
a.add_word("thanks", "thanks")

a.make_automaton()

ranges = [(x[0] - len(x[1]) + 1, x[0] + 1, x[1]) for x in a.iter("no thanks")]

# [(0, 2, 'no'), (0, 9, 'no thanks'), (3, 9, 'thanks')]

def remove_overlap(ranges):
    result = []
    current_stop = -1 
    for start, stop, value in ranges:
        if start > current_stop:
            # this segment starts after the last segment stops
            # just add a new segment
            result.append((stop - start, value))
            current_stop = stop
        elif stop - start > result[-1][0]:
            # segments overlap, replace
            result[-1] = (stop - start, value)
            # current_start already guaranteed to be lower
            current_stop = max(current_stop, stop)

    return [x[1] for x in result]

remove_overlap(ranges)
['no thanks']

In my case it seems that incorporating this code yields a ~15% slowdown.

WojciechMula commented 5 years ago

@kootenpv Thank you for this example! I think this request depends on #82.