Closed jbernau closed 6 years ago
The following changes solved the issue for me
diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py
index a147f04..e8263ae 100644
--- a/RAKE/RAKE.py
+++ b/RAKE/RAKE.py
@@ -64,7 +64,7 @@ def separate_words(text):
@param text The text that must be split in to words.
@param min_word_return_size The minimum no of characters a word must have to be included.
"""
- splitter = re.compile('\W+')
+ splitter = re.compile('(?u)\W+')
words = []
for single_word in splitter.split(text):
current_word = single_word.strip().lower()
@@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list):
for word in stop_word_list:
word_regex = r'\b' + word + r'(?![\w-])'
stop_word_regex_list.append(word_regex)
- return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
+ return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)
Just to humor me before I start looking into this, could you pip uninstall and pip install it again and confirm the issue persists? A week or so ago we pushed a fix that should've addressed that exact issue, and we've historically had problems with pip doing proper upgrades.
On Wed, Nov 29, 2017, 5:34 AM Jürgen Bernau notifications@github.com wrote:
The following changes solved the issue for me
diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py index a147f04..e8263ae 100644 --- a/RAKE/RAKE.py +++ b/RAKE/RAKE.py @@ -64,7 +64,7 @@ def separate_words(text): @param text The text that must be split in to words. @param min_word_return_size The minimum no of characters a word must have to be included. """
- splitter = re.compile('\W+')
- splitter = re.compile('(?u)\W+') words = [] for single_word in splitter.split(text): current_word = single_word.strip().lower() @@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list): for word in stop_word_list: word_regex = r'\b' + word + r'(?![\w-])' stop_word_regex_list.append(word_regex)
- return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
- return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fabianvf/python-rake/issues/33#issuecomment-347819608, or mute the thread https://github.com/notifications/unsubscribe-auth/AShd7Kxo1-olQPef84lE3wYWf2doab8Iks5s7TNMgaJpZM4Quu-w .
-- Thank you for your time, Justin Terry
Thanks for the quick reply!
I uninstalled/reinstalled. Problem persists.
I noticed the recent issue. The fix addressed the sentence_delimiters regex. My change fixes the word_splitter and the stop_word_regex. Which - to me - have the similar problem with unicode.
Awesome, you're right. Can you test your fix in python 3.x and create a PR I can look at?
The following small sample program demonstrates the problem
import RAKE
Rake = RAKE.Rake(['da']);
print(Rake.run(u'und da\xdfselbe nochmal'))
as it returns [(u'\xdfselbe nochmal', 4.0), (u'und', 1.0)]
Tested with python 2.7.6
The issue seems to be, that the regex for word spliting and stopword removal are not unicode.