Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
574 stars 71 forks source link

UnicodeDecodeError for multiple models #100

Open nikit-srivastava opened 2 months ago

nikit-srivastava commented 2 months ago

Hello,

I am facing the following UnicodeDecodeError error:

File "/usr/src/app/server.py", line 188, in <module>
    application = make_app(args)
  File "/usr/src/app/server.py", line 166, in make_app
    worker_pool = initialize_workers(services)
  File "/usr/src/app/server.py", line 147, in initialize_workers
    worker_pool[lang_pair] = TranslatorInterface(
  File "/usr/src/app/server.py", line 17, in __init__
    self.contentprocessor = ContentProcessor(
  File "/usr/src/app/content_processor.py", line 18, in __init__
    self.bpe_source = BPE(BPEcodes)
  File "/usr/src/app/apply_bpe.py", line 37, in __init__
    firstline = codes.readline()
  File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 54: invalid start byte

for the following models:

"it-en" : "https://object.pouta.csc.fi/OPUS-MT-models/it-en/opus-2019-12-18.zip" # SentencePiece
"ja-en" : "https://object.pouta.csc.fi/OPUS-MT-models/ja-en/opus-2019-12-18.zip" # SentencePiece
"id-en" : "https://object.pouta.csc.fi/OPUS-MT-models/id-en/opus-2019-12-18.zip" # SentencePiece
"bn-en" : "https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2020-02-11.zip" # SentencePiece
"et-en" : "https://object.pouta.csc.fi/OPUS-MT-models/et-en/opus-2019-12-18.zip" # SentencePiece
"lv-en" : "https://object.pouta.csc.fi/OPUS-MT-models/lv-en/opus-2019-12-18.zip" # SentencePiece
"th-en" : "https://object.pouta.csc.fi/OPUS-MT-models/th-en/opus-2020-01-16.zip" # SentencePiece
"uk-en" : "https://object.pouta.csc.fi/OPUS-MT-models/uk-en/opus-2020-01-16.zip" # SentencePiece

For most of them (except "lv-en") the error goes away when I switch to the BPE model. However, SentencePiece models are the ones with better translation performance as per the shared metrics.

Please let me know if I am doing something wrong.