SimGus / Chatette

A powerful dataset generator for Rasa NLU, inspired by Chatito
MIT License
320 stars 56 forks source link

UnicodeDecodeError if use Russian language #50

Closed TatianaParshina closed 4 years ago

TatianaParshina commented 4 years ago

Hi,

I tried to create intents with Russian examples but there is an issue: Traceback (most recent call last):

  File "...\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "...\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "...\lib\site-packages\chatette\__main__.py", line 111, in <module>
    main()
  File "...\lib\site-packages\chatette\__main__.py", line 22, in main
    facade.run()
  File "...\lib\site-packages\chatette\facade.py", line 90, in run
    self.run_parsing()
  File "...\lib\site-packages\chatette\facade.py", line 95, in run_parsing
    self.parser.parse_file(self.master_file_path)
  File "...\lib\site-packages\chatette\parsing\parser.py", line 92, in parse_file
    line = self.input_file_manager.read_line()
  File "...\lib\site-packages\chatette\parsing\input_file_manager.py", line 163, in read_line
    line = self._current_file.readline()
  File "...\lib\site-packages\chatette\parsing\line_count_file_wrapper.py", line 28, in readline
    return self.f.readline()
  File "...\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25: character maps to <undefined>

Intent with Russian phrase:

%[intent_no]
  у меня нет никаких вопросов

Versions:

Python 3.7.4
chatette==1.6.2
Windows 10
TatianaParshina commented 4 years ago

Closing the issue. This solution helped: https://stackoverflow.com/a/57134096/10171338

SimGus commented 4 years ago

Hello,

That kind of encoding issues often happen on Windows indeed. I'm happy you could find a way to fix it.

For the record, an easy way to check/change the encoding of a file under Windows is to use the text editor notepad++. To support every writing systems (+ emojis), Chatette expects input files to be encoded in UTF-8 (also called "Unicode on 8bits" or simply "Unicode").

Have a good day/afternoon/evening!