barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
714 stars 164 forks source link

Problem to set 'pt' language #48

Closed lrthorita closed 5 years ago

lrthorita commented 5 years ago

SpellChecker is working for any supported languages, except for Portuguese ('pt'). When I try using spell = SpellChecker('pt'), an error message appears saying:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 587200: character maps to <undefined>

I've also tried to load the dictionary the other way around:

spell = SpellChecker(language=None) spell.word_frequency.load_dictionary('path/to/pt.json.gz', encoding=u'utf-8')

The same error occurs.

I'm using Python 3.6.8 in Windows 10.

lrthorita commented 5 years ago

I found out the problem. In the load_file function at spellchecker/utils.py, the gzip.open is not receiving the encoding argument.

@barrust , please consider fixing the following line of the load_file function: with gzip.open(filename, mode="rt", encoding=encoding) as fobj:

barrust commented 5 years ago

Unfortunately, I am not seeing this issue on my mac or linux boxes in either python 3 or python 2.7 so this may be a windows specific issue.

I am looking at forcing the encoding to the gzip.open function but that does not work in python 2.7.

Could you provide the stack trace? That may help me find a workaround.

Thanks!

barrust commented 5 years ago

There is a PR that should resolve your issue. Can you test? It is the hotfix/gzip-encoding branch

lrthorita commented 5 years ago

Hi @barrust! Thanks for answering.

Indeed, it seems this problem occurs only on Windows. However, I tried to debug the problem and changed that line I've suggested. It worked.

I'll test it.

lrthorita commented 5 years ago

@barrust, I tested the branch you asked. It is working.