aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Word.morphemes / morfessor UnicodeError #66

Closed melissaboiko closed 8 years ago

melissaboiko commented 8 years ago

From the tutorial:

#!/usr/bin/env python3
import polyglot
from polyglot.text import Text, Word
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)

When I try to run it (after calling polyglot.downloader.downloader.download('morph2.en')):

Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    print(word.morphemes)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/text.py", line 286, in morphemes
    words, score = self.morpheme_analyzer.viterbi_segment(self.string)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/text.py", line 282, in morpheme_analyzer
    return load_morfessor_model(lang=self.language)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/polyglot/load.py", line 142, in load_morfessor_model
    model = io.read_any_model(tmp_file_.name)
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 203, in read_any_model
    model.load_segmentations(self.read_segmentation_file(file_name))
  File "/usr/local/lib/python3.4/dist-packages/morfessor/baseline.py", line 487, in load_segmentations
    for count, segmentation in segmentations:
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 53, in read_segmentation_file
    for line in self._read_text_file(file_name):
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 240, in _read_text_file
    encoding = self._find_encoding(file_name)
  File "/usr/local/lib/python3.4/dist-packages/morfessor/io.py", line 320, in _find_encoding
    raise UnicodeError("Can not determine encoding of input files")
UnicodeError: Can not determine encoding of input files

Versions:

$ python3 --version
Python 3.4.2

$ pip3 show polyglot | grep Version
Version: 16.07.04

$ pip3 show morfessor | grep Version
Version: 2.0.1
melissaboiko commented 8 years ago

So it seems that pip3 install morfessor won't pick the latest version. pip3 install 'morfessor>=2.0.2a1' (as per polyglot(1) warning) solved the issue.

aboSamoor commented 8 years ago

Thanks for the tip.