Closed zwd2080 closed 1 year ago
Aho-Corasick just find all the substrings. You (as a developer) have to decide what to do in this situation: choose first/choose last/raise an error "no standalone substrings found"/etc.
@zwd2080 This works using iter_long:
>>> def find_cities_long(text, searcher):
... result = dict()
... for end_index, city_name in searcher.iter_long(preprocess(text)):
... end = end_index - 1
... start = end - len(city_name)
... occurrence_text = text[start:end]
... result[(start, end)] = city_name
... return result
...
>>> print(find_cities( 'BEIJING and Saint Petersburg Town', index))
{(12, 28): 'Saint Petersburg', (12, 33): 'Saint Petersburg Town', (18, 33): 'Petersburg Town'}
>>> print(find_cities_long( 'BEIJING and Saint Petersburg Town', index))
{(12, 33): 'Saint Petersburg Town'}
The output will have overlapping betweet differnt phrases. How to solve the problem of overlapping? Is there any advice?
As shown the example bellow, I want to the results output: