aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Issues with morphemes #111

Open arashsa opened 7 years ago

arashsa commented 7 years ago

I'm running polyglot on Mac OSX, with Python 3.6.0.

When I try running this code:

from polyglot.text import Word
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))

I get this Traceback:

---------------------------------------------------------------------------
UnicodeError                              Traceback (most recent call last)
<ipython-input-73-560dd062bf4f> in <module>()
      3 for w in words:
      4   w = Word(w, language="en")
----> 5   print("{:<20}{}".format(w, w.morphemes))

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/polyglot/decorators.py in __get__(self, obj, cls)
     18     if obj is None:
     19         return self
---> 20     value = obj.__dict__[self.func.__name__] = self.func(obj)
     21     return value
     22 

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/polyglot/text.py in morphemes(self)
    284   @cached_property
    285   def morphemes(self):
--> 286     words, score = self.morpheme_analyzer.viterbi_segment(self.string)
    287     return WordList(words, parent=self, language=self.language)
    288 

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/polyglot/decorators.py in __get__(self, obj, cls)
     18     if obj is None:
     19         return self
---> 20     value = obj.__dict__[self.func.__name__] = self.func(obj)
     21     return value
     22 

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/polyglot/text.py in morpheme_analyzer(self)
    280   @cached_property
    281   def morpheme_analyzer(self):
--> 282     return load_morfessor_model(lang=self.language)
    283 
    284   @cached_property

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/polyglot/decorators.py in memoizer(*args, **kwargs)
     28     key = tuple(list(args) + sorted(kwargs.items()))
     29     if key not in cache:
---> 30       cache[key] = obj(*args, **kwargs)
     31     return cache[key]
     32   return memoizer

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/polyglot/load.py in load_morfessor_model(lang, version)
    140   tmp_file_.close()
    141   io = morfessor.MorfessorIO()
--> 142   model = io.read_any_model(tmp_file_.name)
    143   os.remove(tmp_file_.name)
    144   return model

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/morfessor/io.py in read_any_model(self, file_name)
    201         from morfessor import BaselineModel
    202         model = BaselineModel()
--> 203         model.load_segmentations(self.read_segmentation_file(file_name))
    204         _logger.info("%s was read as a segmentation" % file_name)
    205         return model

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/morfessor/baseline.py in load_segmentations(self, segmentations)
    485 
    486         """
--> 487         for count, segmentation in segmentations:
    488             comp = "".join(segmentation)
    489             self._add_compound(comp, count)

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/morfessor/io.py in read_segmentation_file(self, file_name, has_counts, **kwargs)
     51         """
     52         _logger.info("Reading segmentations from '%s'..." % file_name)
---> 53         for line in self._read_text_file(file_name):
     54             if has_counts:
     55                 count, compound = line.split(' ', 1)

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/morfessor/io.py in _read_text_file(self, file_name)
    238         if encoding is None:
    239             if file_name != '-':
--> 240                 encoding = self._find_encoding(file_name)
    241 
    242         if file_name == '-':

~/.pyenv/versions/polyglot/lib/python3.6/site-packages/morfessor/io.py in _find_encoding(self, *files)
    318                 return encoding
    319 
--> 320         raise UnicodeError("Can not determine encoding of input files")

UnicodeError: Can not determine encoding of input files
YovaKem commented 7 years ago

I found this suggestion: So it seems that pip3 install morfessor won't pick the latest version. pip3 install 'morfessor>=2.0.2a1' (as per polyglot(1) warning) solved the issue. here: https://github.com/aboSamoor/polyglot/issues/66

That didn't work for me, but what worked was: pip install 'morfessor>=2.0.2a1'