barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
714 stars 164 forks source link

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 438: ordinal not in range(128) #21

Closed marchezinixd closed 6 years ago

marchezinixd commented 6 years ago

Hello, I'm having an issue when trying to call SpellChecker in other languages. English work fine, but all the other throws the following error.

from spellchecker import SpellChecker
spell = SpellChecker('es')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 438: ordinal not in range(128)

Any idea of what's going on and how can i solve it?

barrust commented 6 years ago

That is likely a python version issue. SpellChecker supports python 3. Can you confirm that you are using python 3?

python --version
marchezinixd commented 6 years ago

I'm using Python 3.5.2 the whole trace is this:


UnicodeDecodeError Traceback (most recent call last)

in () ----> 1 spell = SpellChecker('es') /usr/local/lib/python3.5/dist-packages/spellchecker/spellchecker.py in __init__(self, language, local_dictionary, distance) 38 'exist!').format(language) 39 raise ValueError(msg) ---> 40 self._word_frequency.load_dictionary(full_filename) 41 42 def __contains__(self, key): /usr/local/lib/python3.5/dist-packages/spellchecker/spellchecker.py in load_dictionary(self, filename) 287 try: 288 with gzip.open(filename, 'rt') as fobj: --> 289 data = fobj.read().lower() 290 except OSError: 291 with open(filename, 'r') as fobj: /usr/lib/python3.5/encodings/ascii.py in decode(self, input, final) 24 class IncrementalDecoder(codecs.IncrementalDecoder): 25 def decode(self, input, final=False): ---> 26 return codecs.ascii_decode(input, self.errors)[0] 27 28 class StreamWriter(Codec,codecs.StreamWriter): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 464: ordinal not in range(128)
barrust commented 6 years ago

Strange, which version of spellchecker are you using?

import spellchecker as spell

print(spell.__version__)
marchezinixd commented 6 years ago

Version: 0.1.5

barrust commented 6 years ago

Strange, it should work on version 0.1.5 as all my tests pass on the same python version. I have some changes in the works to force the gzip.open calls use the encoding property as UTF-8.

marchezinixd commented 6 years ago

The problem was that python3 encode and decode were set for ascii. I changed the Python3 variables, and everything worked fine. Thanks for the support. We can close this.

For anyone with the same problem: https://stackoverflow.com/questions/44344458/why-does-locale-getpreferredencoding-return-ansi-x3-4-1968-instead-of-utf-8 https://perlgeek.de/en/article/set-up-a-clean-utf8-environment

barrust commented 6 years ago

That is great information! Thanks!