Word segmentation on strings without spaces

WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm

BSD 3-Clause "New" or "Revised" License

948 stars 125 forks source link

Word segmentation on strings without spaces #44

Closed ChristosChristofidis closed 6 years ago

ChristosChristofidis commented 8 years ago

Could aho corasick (this module in particular) be used for such task ? I am asking as I have not previously used said algorithm and I didn't find an example showcasing this use.

For Example:

"hellotherehowareyou" --> "hello there how are you" Now with dynamic programing you could do this in O(N^2) . . . can we do better ? aho corasick seems like the way to go?

WojciechMula commented 8 years ago

The Aho-Corasick algorithm allows you to find all occurrences of words from a list. So I think it may help in your problem.

However, I'm pretty sure it won't solve problem of ambiguous strings. What about: "thereare..." -> "there are..."/"the rear e..."?

You will get positions of all words from your list, but later you need to select one of many valid options. Although locating words is done in linear time, I guess the whole algorithm will have worse complexity.

ChristosChristofidis commented 8 years ago

Thank you for your answer.

That kind of ambiguity will not be a major problem in my case I think . Can it also match substrings instead of just prefixes? Say I search for 'hi' in a list of strings. Can I get back the word "this" or "matching"? Right now If I search 'hi' I will get something like "hi","hill" ,"hip","hixxx" ect

Is something like that possible?

WojciechMula commented 8 years ago

Automaton searches for all words you add to it, in order to recognize "this" and "matching" you must add these words. Here is a sample program:

import ahocorasick

# build automaton
A = ahocorasick.Automaton()
for index, word in enumerate('hi this matching ching'.split()):
    A.add_word(word, (index, word))

A.make_automaton()

# search words
for item in A.iter('this matching'):
    print(item)

It prints:

(2, (0, 'hi'))
(3, (1, 'this'))
(10, (0, 'hi'))
(12, (2, 'matching'))
(12, (3, 'ching'))

As you see all words were found, even overlapped.

matanox commented 6 years ago

On the same aspect, any built-in option to avoid matching search words appearing as sub-words?

WojciechMula commented 6 years ago

@matanster There's no such option. AFAIK automaton is not able to distinguish a proper word from its sub-words.