fabianvf / python-rake

MIT License
130 stars 35 forks source link

Unexpected results for german text with umlauts #33

Closed jbernau closed 6 years ago

jbernau commented 6 years ago

The following small sample program demonstrates the problem

import RAKE Rake = RAKE.Rake(['da']); print(Rake.run(u'und da\xdfselbe nochmal'))

as it returns [(u'\xdfselbe nochmal', 4.0), (u'und', 1.0)]

Tested with python 2.7.6

The issue seems to be, that the regex for word spliting and stopword removal are not unicode.

jbernau commented 6 years ago

The following changes solved the issue for me

diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py
index a147f04..e8263ae 100644
--- a/RAKE/RAKE.py
+++ b/RAKE/RAKE.py
@@ -64,7 +64,7 @@ def separate_words(text):
     @param text The text that must be split in to words.
     @param min_word_return_size The minimum no of characters a word must have to be included.
     """
-    splitter = re.compile('\W+')
+    splitter = re.compile('(?u)\W+')
     words = []
     for single_word in splitter.split(text):
         current_word = single_word.strip().lower()
@@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list):
     for word in stop_word_list:
         word_regex = r'\b' + word + r'(?![\w-])'
         stop_word_regex_list.append(word_regex)
-    return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
+    return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)
jkterry1 commented 6 years ago

Just to humor me before I start looking into this, could you pip uninstall and pip install it again and confirm the issue persists? A week or so ago we pushed a fix that should've addressed that exact issue, and we've historically had problems with pip doing proper upgrades.

On Wed, Nov 29, 2017, 5:34 AM Jürgen Bernau notifications@github.com wrote:

The following changes solved the issue for me

diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py index a147f04..e8263ae 100644 --- a/RAKE/RAKE.py +++ b/RAKE/RAKE.py @@ -64,7 +64,7 @@ def separate_words(text): @param text The text that must be split in to words. @param min_word_return_size The minimum no of characters a word must have to be included. """

  • splitter = re.compile('\W+')
  • splitter = re.compile('(?u)\W+') words = [] for single_word in splitter.split(text): current_word = single_word.strip().lower() @@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list): for word in stop_word_list: word_regex = r'\b' + word + r'(?![\w-])' stop_word_regex_list.append(word_regex)
  • return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
  • return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fabianvf/python-rake/issues/33#issuecomment-347819608, or mute the thread https://github.com/notifications/unsubscribe-auth/AShd7Kxo1-olQPef84lE3wYWf2doab8Iks5s7TNMgaJpZM4Quu-w .

-- Thank you for your time, Justin Terry

jbernau commented 6 years ago

Thanks for the quick reply!

I uninstalled/reinstalled. Problem persists.

I noticed the recent issue. The fix addressed the sentence_delimiters regex. My change fixes the word_splitter and the stop_word_regex. Which - to me - have the similar problem with unicode.

jkterry1 commented 6 years ago

Awesome, you're right. Can you test your fix in python 3.x and create a PR I can look at?