WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
948 stars 125 forks source link

ahocorasick search(Automaton.iter() function) - ignore white space option from input string #84

Closed gladtosee closed 5 years ago

gladtosee commented 6 years ago

input_string = '_sh e'

A = ahocorasick.Automaton();
words = "he her hers she".split()
for word in words:
    A.add_word(word, word)
A.make_automaton()
input_string = '_sh e'
list(A.iter(input_string)) #will be return an empty list
#list(A.iter(input_string, ignore_white_space=True)) #i want this result -> [(4, 'she'), (4, 'he')], but not working
WojciechMula commented 6 years ago

@gladtosee Sorry for delay and thank you very much for the MR. Let me review it. @pombredanne would you take a look? Any comments? Generally I like the idea, however I'm curious if plugging a custom predicate method would be useful.

pombredanne commented 6 years ago

@gladtosee thanks for this contribution! can you elaborate a bit your use case? Also since you want to ignore spaces what about using this instead directly?

>>> import ahocorasick
>>> A = ahocorasick.Automaton();
>>> words = "he her hers she".split()
>>> for word in words:
...     A.add_word(word, word)
... 
True
True
True
True
>>> A.make_automaton()
>>> input_string = '_sh e'
>>> list(A.iter(''.join(input_string.split())))
[(3, 'she'), (3, 'he')]
gladtosee commented 6 years ago

@pombredanne hi? I want know original matching string. In this example match strings are

's     h      e' and 'h      e'
>>> import ahocorasick
>>> A = ahocorasick.Automaton();
>>> words = "he her hers she".split()
>>> for word in words:
...     A.add_word(word, word)
... 
True
True
True
True
>>> A.make_automaton()
>>> input_string = '_ s     h      e'
>>> list(A.iter(''.join(input_string.split())))
[(3, 'she'), (3, 'he')]
>>> list(A.iter(input_string, ignore_white_space=True))
[(15, 'she'), (15, 'he')]
>>> def find_not_space_start(s, end_index, length):
>>>     i = end_index
>>>     not_space_count = 0
>>>     while i >= 0 and not_space_count < length:
>>>         if not s[i].isspace():
>>>             not_space_count += 1
>>>             if not_space_count == length:
>>>                 return i
>>>         i -= 1
>>> 
>>>     raise ValueError('not found exact string!!')
>>> for end_index, key in A.iter(input_string, ignore_white_space=True):
...     origin_length = len(key)
...     start_index = find_not_space_start(input_string, end_index, origin_length)
...     raw_find_str = input_string[start_index:end_index + 1]
...     print('[%s]' % raw_find_str)    
[s     h      e]
[h      e]
gladtosee commented 6 years ago

@WojciechMula @pombredanne Please review this commit...

WojciechMula commented 6 years ago

@gladtosee Sorry for late answer, I wish I had enough time and energy to handle everything I want. :) I have two generic remarks:

  1. This solution uses isspace from stdlib. I understand it's your use case, but I'd love to have possibility to plug a generic predicate, preferable a custom python procedure. Of course at this point your solution is OK, I mean it opens new possibilities, that I really like.
  2. I'm a bit concerned about possible performance issues. Could you please run benchmark we have in the repository. Please run it on the same machine, on code with and without your changes.
gladtosee commented 6 years ago

@WojciechMula

This is the only part of the performance involved. https://github.com/WojciechMula/pyahocorasick/pull/85/commits/20affb532a901cef1ce8373f7ba0a023e31142f3#diff-0e4469221b508367ef77172960d76afdR268

    while(iswspace(iter->input.word[iter->index]) && (iter->index < iter->end)) {
        iter->index += 1;
    }

So there seems to be little impact on performance.

my mac environment

MacBook Pro (15-inch, 2017)
2.9 GHz Intel Core i7
16GB 2133 MHz LPDDR3
Radeon Pro 560 4096 MB
Intel HD Graphics 630 1536 MB

before ignore_white_space option

Generating data (1000000 words)         : benchmark3.py:24: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  self.start = clock()
benchmark3.py:27: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  self.stop = clock()
33.384 s
Add words                               : 2.384 s
Building automaton                      : 20.110 s
Look up                                 : 3.352 s
Search                                  : 1.081 s

after ignore_white_space option

Generating data (1000000 words)         : benchmark3.py:24: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  self.start = clock()
benchmark3.py:27: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  self.stop = clock()
32.609 s
Add words                               : 2.335 s
Building automaton                      : 20.222 s
Look up                                 : 3.277 s
Search                                  : 1.089 s
WojciechMula commented 6 years ago

@gladtosee Thank you very much for performance tests, it's clear that impact it negligible.

I'm for your change. @pombredanne, do you have any comments or concerns?

pombredanne commented 6 years ago

@WojciechMula I tested it and may not have a use for the feature, but this is LGTM, clean and tested. So +1 from me.

WojciechMula commented 6 years ago

@pombredanne Great, thank you. I'll merge and release new version tomorrow.