daniel-kukiela / nmt-chatbot

NMT Chatbot
GNU General Public License v3.0
387 stars 214 forks source link

Ignores decoding errors when reading files #160

Closed mxgordon closed 2 months ago

mxgordon commented 4 years ago

When I tried to use my own dataset (from the Cornell movies dialogues), it would throw

Traceback (most recent call last):
  File "prepare_data.py", line 563, in <module>
    prepare()
  File "prepare_data.py", line 79, in prepare
    number_of_records = min(amount, sum(1 for _ in open_function(source_file_name, 'rt', encoding='utf-8', **additioan_params)))
  File "prepare_data.py", line 79, in <genexpr>
    number_of_records = min(amount, sum(1 for _ in open_function(source_file_name, 'rt', encoding='utf-8', **additioan_params)))
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 6816: invalid start byte

So I told it to ignore any decoding error like that. This just adds ease of use to the program.