WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
914 stars 122 forks source link

How to solve the problem of overlapping of matching ? #169

Closed zwd2080 closed 1 year ago

zwd2080 commented 2 years ago

The output will have overlapping betweet differnt phrases. How to solve the problem of overlapping? Is there any advice?

As shown the example bellow, I want to the results output:

def preprocess(text):
    return '_{}_'.format(re.sub('[^a-z]', '_', text.lower()))
index = ahocorasick.Automaton()

for city in [   'Petersburg Town',  'Saint Petersburg', 'Saint Petersburg Town']:
    #print ( preprocess(city))
    index.add_word(preprocess(city), city)
index.make_automaton()

def find_cities(text, searcher):
    result = dict()
    for end_index, city_name in searcher.iter(preprocess(text)):
        end = end_index - 1
        start = end - len(city_name)
        occurrence_text = text[start:end]
        result[(start, end)] = city_name
    return result

print(find_cities( 'BEIJING and Saint Petersburg Town', index))
# outpout is : {(12, 28): 'Saint Petersburg', (12, 33): 'Saint Petersburg Town', (18, 33): 'Petersburg Town'}
abcdenis commented 1 year ago

Aho-Corasick just find all the substrings. You (as a developer) have to decide what to do in this situation: choose first/choose last/raise an error "no standalone substrings found"/etc.

pombredanne commented 1 year ago

@zwd2080 This works using iter_long:

>>> def find_cities_long(text, searcher):
...     result = dict()
...     for end_index, city_name in searcher.iter_long(preprocess(text)):
...         end = end_index - 1
...         start = end - len(city_name)
...         occurrence_text = text[start:end]
...         result[(start, end)] = city_name
...     return result
... 
>>> print(find_cities( 'BEIJING and Saint Petersburg Town', index))
{(12, 28): 'Saint Petersburg', (12, 33): 'Saint Petersburg Town', (18, 33): 'Petersburg Town'}
>>> print(find_cities_long( 'BEIJING and Saint Petersburg Town', index))
{(12, 33): 'Saint Petersburg Town'}