Closed gladtosee closed 5 years ago
@gladtosee Sorry for delay and thank you very much for the MR. Let me review it. @pombredanne would you take a look? Any comments? Generally I like the idea, however I'm curious if plugging a custom predicate method would be useful.
@gladtosee thanks for this contribution! can you elaborate a bit your use case? Also since you want to ignore spaces what about using this instead directly?
>>> import ahocorasick
>>> A = ahocorasick.Automaton();
>>> words = "he her hers she".split()
>>> for word in words:
... A.add_word(word, word)
...
True
True
True
True
>>> A.make_automaton()
>>> input_string = '_sh e'
>>> list(A.iter(''.join(input_string.split())))
[(3, 'she'), (3, 'he')]
@pombredanne hi? I want know original matching string. In this example match strings are
's h e' and 'h e'
>>> import ahocorasick
>>> A = ahocorasick.Automaton();
>>> words = "he her hers she".split()
>>> for word in words:
... A.add_word(word, word)
...
True
True
True
True
>>> A.make_automaton()
>>> input_string = '_ s h e'
>>> list(A.iter(''.join(input_string.split())))
[(3, 'she'), (3, 'he')]
>>> list(A.iter(input_string, ignore_white_space=True))
[(15, 'she'), (15, 'he')]
>>> def find_not_space_start(s, end_index, length):
>>> i = end_index
>>> not_space_count = 0
>>> while i >= 0 and not_space_count < length:
>>> if not s[i].isspace():
>>> not_space_count += 1
>>> if not_space_count == length:
>>> return i
>>> i -= 1
>>>
>>> raise ValueError('not found exact string!!')
>>> for end_index, key in A.iter(input_string, ignore_white_space=True):
... origin_length = len(key)
... start_index = find_not_space_start(input_string, end_index, origin_length)
... raw_find_str = input_string[start_index:end_index + 1]
... print('[%s]' % raw_find_str)
[s h e]
[h e]
@WojciechMula @pombredanne Please review this commit...
@gladtosee Sorry for late answer, I wish I had enough time and energy to handle everything I want. :) I have two generic remarks:
@WojciechMula
This is the only part of the performance involved. https://github.com/WojciechMula/pyahocorasick/pull/85/commits/20affb532a901cef1ce8373f7ba0a023e31142f3#diff-0e4469221b508367ef77172960d76afdR268
while(iswspace(iter->input.word[iter->index]) && (iter->index < iter->end)) {
iter->index += 1;
}
So there seems to be little impact on performance.
my mac environment
MacBook Pro (15-inch, 2017)
2.9 GHz Intel Core i7
16GB 2133 MHz LPDDR3
Radeon Pro 560 4096 MB
Intel HD Graphics 630 1536 MB
before ignore_white_space option
Generating data (1000000 words) : benchmark3.py:24: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
self.start = clock()
benchmark3.py:27: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
self.stop = clock()
33.384 s
Add words : 2.384 s
Building automaton : 20.110 s
Look up : 3.352 s
Search : 1.081 s
after ignore_white_space option
Generating data (1000000 words) : benchmark3.py:24: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
self.start = clock()
benchmark3.py:27: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
self.stop = clock()
32.609 s
Add words : 2.335 s
Building automaton : 20.222 s
Look up : 3.277 s
Search : 1.089 s
@gladtosee Thank you very much for performance tests, it's clear that impact it negligible.
I'm for your change. @pombredanne, do you have any comments or concerns?
@WojciechMula I tested it and may not have a use for the feature, but this is LGTM, clean and tested. So +1 from me.
@pombredanne Great, thank you. I'll merge and release new version tomorrow.
input_string = '_sh e'